Issue
I am having trouble making the transition from using R data.table to using Pandas for data munging.
Specifically, I am trying to assign the results of aggregations back into the original df as a new column. Note that the aggregations are functions of two columns, so I don't think df.transform()
is the right approach.
To illustrate, I'm trying to replicate what I would do in R with:
library(data.table)
df = as.data.table(read.csv(text=
"id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108"))
df[term == 'qtr' , `:=`(vwap_ish = sum(hours * price),
avg_id = mean(id) ),
.(node, term)]
df
# id term node hours price vwap_ish avg_id
# 1: 1 qtr A 300 107 90600 2
# 2: 2 qtr A 300 104 90600 2
# 3: 3 qtr A 300 91 90600 2
# 4: 4 qtr B 300 89 95400 5
# 5: 5 qtr B 300 113 95400 5
# 6: 6 qtr B 300 116 95400 5
# 7: 7 mth A 50 110 NA NA
# 8: 8 mth A 100 119 NA NA
# 9: 9 mth A 150 99 NA NA
# 10: 10 mth B 50 111 NA NA
# 11: 11 mth B 100 106 NA NA
# 12: 12 mth B 150 108 NA NA
Using Pandas, I can create an object from df that contains all rows of the original df, with the aggregations
import io
import numpy as np
import pandas as pd
data = io.StringIO("""id,term,node,hours,price
1,qtr,A,300,107
2,qtr,A,300,104
3,qtr,A,300,91
4,qtr,B,300,89
5,qtr,B,300,113
6,qtr,B,300,116
7,mth,A,50,110
8,mth,A,100,119
9,mth,A,150,99
10,mth,B,50,111
11,mth,B,100,106
12,mth,B,150,108""")
df = pd.read_csv(data)
df1 = df.groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df1
"""
id term node hours price vwap_ish avg_id
node term
ben mth 9 10 mth B 50 111 32350 10.0
10 11 mth B 100 106 32350 10.0
11 12 mth B 150 108 32350 10.0
qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
ota mth 6 7 mth A 50 110 32250 7.0
7 8 mth A 100 119 32250 7.0
8 9 mth A 150 99 32250 7.0
qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
"""
This doesn't really get me what I want because a) it re-orders and creates indices, and b) it has calculated the aggregation for all rows.
I can get the subset easily enough with
df2 = df[df.term=='qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
df2
"""
id term node hours price vwap_ish avg_id
node term
A qtr 0 1 qtr A 300 107 90600 1.0
1 2 qtr A 300 104 90600 1.0
2 3 qtr A 300 91 90600 1.0
B qtr 3 4 qtr B 300 89 95400 4.0
4 5 qtr B 300 113 95400 4.0
5 6 qtr B 300 116 95400 4.0
"""
but I can't get the values in the new columns (vwap_ish, avg_id) back into the old df.
I've tried:
df[df.term=='qtr'] = df[df.term == 'qtr'].groupby(['node','term']).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'
And also a few variations of .merge and .join. For example:
df.merge(df2, how='left')
ValueError: 'term' is both an index level and a column label, which is ambiguous.
and
df.merge(df2, how='left', on=df.columns)
KeyError: Index(['id', 'term', 'node', 'hours', 'price'], dtype='object')
In writing this I realised I could take my first approach and then just do
df[df.term=='qtr', ['vwap_ish','avg_id']] = NaN
but this seems quite hacky. It means I have to use a new column, rather than overwriting an existing one on the filtered rows, and if the aggregation function were to break say if term='mth' then that would be problematic too.
I'd really appreciate any help with this as it's been a very steep learning curve to try to make the transition from data.table to Pandas and there's so much I would do in a one-liner that is taking me hours to figure out.
Solution
You can add group_keys=False
parameter for remove MultiIndex
, so left join working well:
df2 = df[df.term == 'qtr'].groupby(['node','term'], group_keys=False).apply(
lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
)
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
Solution without left join:
m = df.term == 'qtr'
df.loc[m, ['vwap_ish','avg_id']] = (df[m].groupby(['node','term'], group_keys=False)
.apply(lambda gp: gp.assign(
vwap_ish = (gp.hours * gp.price).sum(),
avg_id = np.mean(gp.id)
)
))
Improved solution with named aggregation and create vwap_ish
column before groupby
can improve performance:
df2 = (df[df.term == 'qtr']
.assign(vwap_ish = lambda x: x.hours * x.price)
.groupby(['node','term'], as_index=False)
.agg(vwap_ish=('vwap_ish','sum'),
avg_id=('id','mean')))
df = df.merge(df2, how='left')
print (df)
id term node hours price vwap_ish avg_id
0 1 qtr A 300 107 90600.0 2.0
1 2 qtr A 300 104 90600.0 2.0
2 3 qtr A 300 91 90600.0 2.0
3 4 qtr B 300 89 95400.0 5.0
4 5 qtr B 300 113 95400.0 5.0
5 6 qtr B 300 116 95400.0 5.0
6 7 mth A 50 110 NaN NaN
7 8 mth A 100 119 NaN NaN
8 9 mth A 150 99 NaN NaN
9 10 mth B 50 111 NaN NaN
10 11 mth B 100 106 NaN NaN
11 12 mth B 150 108 NaN NaN
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.