Issue
I have a function and in this function I would like to calculate the weighted average of the column other_column
(weighted with column amount
). If I did not have this in a function then it would work, but like this I am not sure how to pass the dataframe? I'm also getting an error: NameError: name 'df1' is not defined
.
def weighted_mean(x):
try:
return np.average(x, weights=df1.loc[x.index, 'amount']) > 0.5
except ZeroDivisionError:
return 0
def some_function(df1=None):
df1 = df1.groupby('id').agg(xx=('amount', lambda x: x.sum() > 100),
yy=('other_col', weighted_mean)).reset_index()
return df1
df2 = pd.DataFrame({'id':[1,1,2,2,3], 'amount':[10, 200, 1, 10, 150], 'other_col':[0.1, 0.6, 0.7, 0.2, 0.4]})
df2 = some_function(df1=df2)
so that I get
id xx yy
0 1 True True
1 2 False False
2 3 True False
Solution
Your fundamental issue is that you try to apply a groupby.agg
with a function that relies on multiple columns. That's impossible, unless you rely on side effects, which cannot allow a general function (the function must be designed to hardcode the side effect).
# the function is hardcoded to use df2
# this makes it non generic
def weighted_mean(x):
try:
return np.average(x, weights=df2.loc[x.index, 'amount']) > 0.5
except ZeroDivisionError:
return 0
Instead, use groupby.apply
and rewrite your function to take a DataFrame as input:
def weighted_mean(df):
try:
return np.average(df['other_col'], weights=df['amount']) > 0.5
except ZeroDivisionError:
return 0
def some_function(df=None):
def inner(g):
return pd.Series({
'xx': g['amount'].sum()>100,
'yy': weighted_mean(g),
})
return (df.groupby('id', as_index=False)
.apply(inner)
)
df2 = pd.DataFrame({'id':[1,1,2,2,3],
'amount':[10, 200, 1, 10, 150],
'other_col':[0.1, 0.6, 0.7, 0.2, 0.4]})
out = some_function(df=df2)
Alternatively, define weighted_mean
as an inner function of some_function
:
def some_function(df=None):
def weighted_mean(x):
try:
return np.average(x, weights=df.loc[x.index, 'amount']) > 0.5
except ZeroDivisionError:
return 0
return (df.groupby('id')
.agg(xx=('amount', lambda x: x.sum() > 100),
yy=('other_col', weighted_mean))
.reset_index()
)
df2 = pd.DataFrame({'id':[1,1,2,2,3],
'amount':[10, 200, 1, 10, 150],
'other_col':[0.1, 0.6, 0.7, 0.2, 0.4]})
out = some_function(df=df2)
Output:
id xx yy
0 1 True True
1 2 False False
2 3 True False
Answered By - mozway
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.