Issue
Lets's assume I count how many oranges (Orange
) and apples (Apple
) people (id
) eat in a certain time period. I also know if they are young or old (group
). The pandas dataframe would maybe look like this:
df = pd.DataFrame({'id': ['1','2','3','7'],
'group': ['Young', 'Young', 'Old', 'Old'],
'Apple': [7,2,5,4],
'Orange': [3,6,4,4] })
id group Apple Orange
0 1 Young 7 3
1 2 Young 2 6
2 3 Old 5 4
3 7 Old 4 4
We can easily compute means using groupby()
. E. g:
df.Apple.groupby(df.group).mean()
outputs
Old 4.5
Young 4.5
But let's say, I want to find how much the amount of apples and oranges consumed differs to the group mean per individual?
That is, the output should be
df = pd.DataFrame({'id': ['1','2','3','7'],
'group': ['Young', 'Young', 'Old', 'Old'],
'Apple': [7,2,5,4],
'Orange': [3,6,4,4],
'Apple Difference': [2.5, -2.5, 0.5, -0.5] })
id group Apple Orange Apple Difference
0 1 Young 7 3 2.5
1 2 Young 2 6 -2.5
2 3 Old 5 4 0.5
3 7 Old 4 4 -0.5
Is there a way to do this with pandas/numpy? Sorry for the rockie-question Best /R
Solution
You need transform
for mean
with same length
as df
and substract by sub
:
print (df.groupby('group')['Apple'].transform('mean'))
0 4.5
1 4.5
2 4.5
3 4.5
Name: Apple, dtype: float64
df = pd.DataFrame({'id' : ['1','2','3','7'],
'group' : ['Young', 'Young', 'Old', 'Old'],
'Apple' : [7,2,5,4],
'Orange' : [3,6,4,4],
})
df['Apple Difference'] = df['Apple'].sub(df.groupby('group')['Apple'].transform('mean'))
print (df)
Apple Orange group id Apple Difference
0 7 3 Young 1 2.5
1 2 6 Young 2 -2.5
2 5 4 Old 3 0.5
3 4 4 Old 7 -0.5
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.