Aggregate Df With User Defined Function
Solution 1:
None of the answers here have addressed why this is failing. If you dig into the pandas code, when a UDF is passed to df.agg
, a Series object for each column will be passed to the UDF.
In your case, using a dictionary selects a Series object (a column) and the UDF is then passed to the Series object's Series.agg
function. Because it is not a known function (like the string 'mean'
), it ends up being passed to Series.apply
, which maps the function over each value in the Series object. This is the result you are seeing.
Luckily, the passing of the UDF to Series.apply
happens in a try/except
block. If it fail to to work using Series.apply(func)
, it swaps to passing the Series object to the function via func(Series)
. You can use this to modify your code to raise an error if the passed object is not a Series or DataFrame.
def CoV(_s):
if not isinstance(_s, (pd.Series, pd.DataFrame, np.array)):
raise TypeError()
return pd.Series({'CoV' : np.std(_s)/np.mean(_s)})
Now passing it to .agg
works as you would expect. It is a hacky work-around, but it works.
df.agg({'a': CoV})
# returns:
a
CoV 0.584645
EDIT:
To get this to work with other functions, like 'mean'
, you will have to pass those as UDFs as well, unfortunately. Even worse, the accumulation of the results is different for UDFs than for built-in functions. Pandas simply stacks them horizontally with a hierarchical column index. A simple stack
and reset_index
fixes this.
def check_input(fn):
def wrapper(_s, *args, **kwargs):
if not isinstance(_s, (pd.Series, pd.DataFrame, np.array)):
raise TypeError()
return fn(_s, *args, **kwargs)
wrapper.__name__ = fn.__name__
return wrapper
@check_input
def Mean(_s):
return pd.Series({'Mean': np.mean(_s)})
@check_input
def CoV(_s):
return pd.Series({'CoV' : np.std(_s)/np.mean(_s)})
df.agg({'a': [CoV, Mean], 'c': Mean}).stack().reset_index(level=-1, drop=True)
# returns:
a c
CoV 0.584645 NaN
Mean 0.511350 2.011
Solution 2:
This will give you the desired result:
df.assign(k=1).groupby('k')['a'].apply(CoV).reset_index(drop=True)
So you assign k
just to use it for groupby
and then remove it by reseting
and droping
index.
Solution 3:
Try using .apply()
:
df.apply(CoV, axis=0)
This also works for me:
test4 = df.agg(CoV, axis=0)
What you'll get is a dataframe with scalar results of the applied function:
a b c
CoV 0.585977 0.584645 0.406688
Then just slice the Series you need.
Assumptions: You want to apply a single custom scalar function (Series to scalar) on different columns without group-bys .
Edit: If you'd like to combine multiple functions, another thing you can do is to present all of them as output of your function (which returns a pd.Series
). For example you can rewrite your custom function as:
def myfunc(_s):
return pd.Series({'mean': _s.mean(),
'std': _s.std(),
'CoV' : np.std(_s)/np.mean(_s)})
Then running this one with .apply()
will yield multiple results.
df.apply(myfunc)
will now give:
a b c
mean 0.495922 0.511350 2.011000
std 0.290744 0.299108 0.818259
CoV 0.585977 0.584645 0.406688
See more here: Pandas how to apply multiple functions to dataframe
Post a Comment for "Aggregate Df With User Defined Function"