Skip to content Skip to sidebar Skip to footer

Aggregate Df With User Defined Function

I have a question regarding aggregating pandas dataframes with user defined functions. If i have a dataframe and run agg with or without groupby the result is aggregated when built

Solution 1:

None of the answers here have addressed why this is failing. If you dig into the pandas code, when a UDF is passed to df.agg, a Series object for each column will be passed to the UDF.

In your case, using a dictionary selects a Series object (a column) and the UDF is then passed to the Series object's Series.agg function. Because it is not a known function (like the string 'mean'), it ends up being passed to Series.apply, which maps the function over each value in the Series object. This is the result you are seeing.

Luckily, the passing of the UDF to Series.apply happens in a try/except block. If it fail to to work using Series.apply(func), it swaps to passing the Series object to the function via func(Series). You can use this to modify your code to raise an error if the passed object is not a Series or DataFrame.

def CoV(_s):
    if not isinstance(_s, (pd.Series, pd.DataFrame, np.array)):
        raise TypeError()
    return pd.Series({'CoV' : np.std(_s)/np.mean(_s)})

Now passing it to .agg works as you would expect. It is a hacky work-around, but it works.

df.agg({'a': CoV})
# returns:
            a
CoV  0.584645

EDIT:

To get this to work with other functions, like 'mean', you will have to pass those as UDFs as well, unfortunately. Even worse, the accumulation of the results is different for UDFs than for built-in functions. Pandas simply stacks them horizontally with a hierarchical column index. A simple stack and reset_index fixes this.

def check_input(fn):
    def wrapper(_s, *args, **kwargs):
        if not isinstance(_s, (pd.Series, pd.DataFrame, np.array)):
            raise TypeError()
        return fn(_s, *args, **kwargs)
    wrapper.__name__ = fn.__name__
    return wrapper

@check_input
def Mean(_s):
    return pd.Series({'Mean': np.mean(_s)})

@check_input
def CoV(_s):
    return pd.Series({'CoV' : np.std(_s)/np.mean(_s)})

df.agg({'a': [CoV, Mean], 'c': Mean}).stack().reset_index(level=-1, drop=True)
# returns:
             a      c
CoV   0.584645    NaN
Mean  0.511350  2.011

Solution 2:

This will give you the desired result:

df.assign(k=1).groupby('k')['a'].apply(CoV).reset_index(drop=True)

So you assign k just to use it for groupby and then remove it by reseting and droping index.


Solution 3:

Try using .apply(): df.apply(CoV, axis=0)

This also works for me: test4 = df.agg(CoV, axis=0)

What you'll get is a dataframe with scalar results of the applied function:

            a         b         c
CoV  0.585977  0.584645  0.406688

Then just slice the Series you need.

Assumptions: You want to apply a single custom scalar function (Series to scalar) on different columns without group-bys .

Edit: If you'd like to combine multiple functions, another thing you can do is to present all of them as output of your function (which returns a pd.Series). For example you can rewrite your custom function as:

def myfunc(_s):
    return pd.Series({'mean': _s.mean(), 
                       'std': _s.std(), 
                       'CoV' : np.std(_s)/np.mean(_s)})

Then running this one with .apply() will yield multiple results. df.apply(myfunc) will now give:

               a           b           c
mean    0.495922    0.511350    2.011000
std     0.290744    0.299108    0.818259
CoV     0.585977    0.584645    0.406688

See more here: Pandas how to apply multiple functions to dataframe


Post a Comment for "Aggregate Df With User Defined Function"