Issue
I have a dataframe like as shown below
df = pd.DataFrame({'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'invoice_id':[1,2,3,4,5,6,7,8,9,10,11,12],
'purchase_date' :['2017-04-03 12:35:00','2017-04-03 12:50:00','2018-04-05 12:59:00','2018-05-04 13:14:00','2017-05-05 13:37:00','2018-07-06 13:39:00','2018-07-08 11:30:00','2017-04-08 16:00:00','2019-04-09 22:00:00','2019-04-11 04:00:00','2018-04-13 04:30:00','2017-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6],
'Prod_id':['A1','A1','C1','A1','E1','Q1','G1','F1','G1','H1','J1','A1']})
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
I would like to do the below
a) group the data by subject_id
at 3 month intervals (using purchase date column)
b) Compute statistics such as mean, sum, nunique and count for each of the group based on their other variables such as prod_id
, val
etc
For ex: the earliest purchase date in df starts at 2017-04-03
. Here starting month in the dataset is April. So, we count 3 months from Apr. So, APR, MAY and JUN
will be M1
and July, Aug and Sept
will be M2
and so on. We do this to create data at 3-month intervals.
whenever there is no data for in between for 3 months, we put that as zero (0)
So, I tried something like below based on online research
length_of_purchase_date = 10
date_month_map = {
str(x)[:length_of_purchase_date]: 'M_%s' % (i+1) for i, x in enumerate(
sorted(data_df.reset_index()['purchase_date'].unique(), reverse=True)
)
}
df.reset_index().groupby(['subject_id',
pd.Grouper(key='pruchase_date', freq='3M')
]).agg({
'val': [sum, mean,count],
})
I expect my output to be like as shown below (shown for subject_id = 1). Please note that I have to do this on big data with millions of rows.
Solution
Use:
#sorting per subject_id, purchase_date
df = df.sort_values(['subject_id','purchase_date'])
#create month groups by convert to month periods with subtract min values
per = df['purchase_date'].dt.to_period('m').astype('int')
df['date_group'] = (per.sub(per.min()) // 3 + 1)
#custom function
def f(x):
#get counts
y = x.value_counts()
#filter max values anf filter out if counts is 1
y = y[y.eq(y.max()) & y.ne(1)]
#if duplicates, empty y get NaN, value if exist one maximum non 1 value
return y.index[0] if len(y) == 1 else np.nan
#aggregate per both columns
df = df.groupby(['subject_id', 'date_group']).agg(max_date=('purchase_date','max'),
nunique=('Prod_id','nunique'),
count_prod_id=('Prod_id','count'),
sum_val=('val','sum'),
avg_val=('val','mean'),
min_val=('val','min'),
max_val=('val','max'),
Top1st_prod_id=('Prod_id',f))
#dictionary for fillna by 0
d = dict.fromkeys(df.columns.difference(['max_date','Top1st_prod_id']), 0)
#add missing date_group with reindex
df = (df.reset_index(level=0)
.groupby('subject_id')
.apply(lambda x: x.reindex(range(1, x.index.max() + 1)))
.fillna(d))
#custom formating of columns
df['max_date'] = df['max_date'].dt.strftime('%d-%b-%y')
df = df.drop('subject_id', axis=1).reset_index()
df['date_group'] = 'M' + df['date_group'].astype(str)
print (df)
subject_id date_group max_date nunique count_prod_id sum_val \
0 1 M1 05-May-17 2.0 3.0 11.0
1 1 M2 NaN 0.0 0.0 0.0
2 1 M3 NaN 0.0 0.0 0.0
3 1 M4 NaN 0.0 0.0 0.0
4 1 M5 04-May-18 2.0 2.0 10.0
5 1 M6 08-Jul-18 2.0 2.0 11.0
6 2 M1 14-Apr-17 2.0 2.0 11.0
7 2 M2 NaN 0.0 0.0 0.0
8 2 M3 NaN 0.0 0.0 0.0
9 2 M4 NaN 0.0 0.0 0.0
10 2 M5 13-Apr-18 1.0 1.0 4.0
11 2 M6 NaN 0.0 0.0 0.0
12 2 M7 NaN 0.0 0.0 0.0
13 2 M8 NaN 0.0 0.0 0.0
14 2 M9 11-Apr-19 2.0 2.0 11.0
avg_val min_val max_val Top1st_prod_id
0 3.666667 1.0 5.0 A1
1 0.000000 0.0 0.0 NaN
2 0.000000 0.0 0.0 NaN
3 0.000000 0.0 0.0 NaN
4 5.000000 5.0 5.0 NaN
5 5.500000 5.0 6.0 NaN
6 5.500000 5.0 6.0 NaN
7 0.000000 0.0 0.0 NaN
8 0.000000 0.0 0.0 NaN
9 0.000000 0.0 0.0 NaN
10 4.000000 4.0 4.0 NaN
11 0.000000 0.0 0.0 NaN
12 0.000000 0.0 0.0 NaN
13 0.000000 0.0 0.0 NaN
14 5.500000 3.0 8.0 NaN
Answered By - jezrael
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.