Monday, November 14, 2022

[FIXED] pandas group data at 3 month intervals and aggregate list of functions

November 14, 2022 dataframe, group-by, pandas, python, time-series No comments

Issue

I have a dataframe like as shown below

df = pd.DataFrame({'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
                   'invoice_id':[1,2,3,4,5,6,7,8,9,10,11,12],
                   'purchase_date' :['2017-04-03 12:35:00','2017-04-03 12:50:00','2018-04-05 12:59:00','2018-05-04 13:14:00','2017-05-05 13:37:00','2018-07-06 13:39:00','2018-07-08 11:30:00','2017-04-08 16:00:00','2019-04-09 22:00:00','2019-04-11 04:00:00','2018-04-13 04:30:00','2017-04-14 08:00:00'],
                   'val' :[5,5,5,5,1,6,5,5,8,3,4,6],
                   'Prod_id':['A1','A1','C1','A1','E1','Q1','G1','F1','G1','H1','J1','A1']})
df['purchase_date'] = pd.to_datetime(df['purchase_date'])

I would like to do the below

a) group the data by subject_id at 3 month intervals (using purchase date column)

b) Compute statistics such as mean, sum, nunique and count for each of the group based on their other variables such as prod_id, val etc

For ex: the earliest purchase date in df starts at 2017-04-03. Here starting month in the dataset is April. So, we count 3 months from Apr. So, APR, MAY and JUN will be M1 and July, Aug and Sept will be M2 and so on. We do this to create data at 3-month intervals. whenever there is no data for in between for 3 months, we put that as zero (0)

So, I tried something like below based on online research

    length_of_purchase_date = 10
    
    date_month_map = {
        str(x)[:length_of_purchase_date]: 'M_%s' % (i+1) for i, x in enumerate(
            sorted(data_df.reset_index()['purchase_date'].unique(), reverse=True)
        )
    } 
    df.reset_index().groupby(['subject_id',
    pd.Grouper(key='pruchase_date', freq='3M')
]).agg({
    'val': [sum, mean,count],
})

I expect my output to be like as shown below (shown for subject_id = 1). Please note that I have to do this on big data with millions of rows.

Solution

Use:

#sorting per subject_id, purchase_date
df = df.sort_values(['subject_id','purchase_date'])

#create month groups by convert to month periods with subtract min values
per = df['purchase_date'].dt.to_period('m').astype('int')
df['date_group'] = (per.sub(per.min()) // 3 + 1)

#custom function
def f(x):
    #get counts
    y = x.value_counts()
    #filter max values anf filter out if counts is 1
    y = y[y.eq(y.max()) & y.ne(1)]
    #if duplicates, empty y get NaN, value if exist one maximum non 1 value
    return y.index[0] if len(y) == 1 else np.nan

#aggregate per both columns
df = df.groupby(['subject_id', 'date_group']).agg(max_date=('purchase_date','max'),
                                                  nunique=('Prod_id','nunique'),
                                                  count_prod_id=('Prod_id','count'),
                                                  sum_val=('val','sum'),
                                                  avg_val=('val','mean'),
                                                  min_val=('val','min'),
                                                  max_val=('val','max'),
                                                  Top1st_prod_id=('Prod_id',f))

#dictionary for fillna by 0
d = dict.fromkeys(df.columns.difference(['max_date','Top1st_prod_id']), 0)
#add missing date_group with reindex 
df = (df.reset_index(level=0)
         .groupby('subject_id')
         .apply(lambda x: x.reindex(range(1, x.index.max() + 1)))
         .fillna(d)) 

#custom formating of columns
df['max_date'] = df['max_date'].dt.strftime('%d-%b-%y')
df = df.drop('subject_id', axis=1).reset_index()
df['date_group'] = 'M' + df['date_group'].astype(str)

print (df)
    subject_id date_group   max_date  nunique  count_prod_id  sum_val  \
0            1         M1  05-May-17      2.0            3.0     11.0   
1            1         M2        NaN      0.0            0.0      0.0   
2            1         M3        NaN      0.0            0.0      0.0   
3            1         M4        NaN      0.0            0.0      0.0   
4            1         M5  04-May-18      2.0            2.0     10.0   
5            1         M6  08-Jul-18      2.0            2.0     11.0   
6            2         M1  14-Apr-17      2.0            2.0     11.0   
7            2         M2        NaN      0.0            0.0      0.0   
8            2         M3        NaN      0.0            0.0      0.0   
9            2         M4        NaN      0.0            0.0      0.0   
10           2         M5  13-Apr-18      1.0            1.0      4.0   
11           2         M6        NaN      0.0            0.0      0.0   
12           2         M7        NaN      0.0            0.0      0.0   
13           2         M8        NaN      0.0            0.0      0.0   
14           2         M9  11-Apr-19      2.0            2.0     11.0   

     avg_val  min_val  max_val Top1st_prod_id  
0   3.666667      1.0      5.0             A1  
1   0.000000      0.0      0.0            NaN  
2   0.000000      0.0      0.0            NaN  
3   0.000000      0.0      0.0            NaN  
4   5.000000      5.0      5.0            NaN  
5   5.500000      5.0      6.0            NaN  
6   5.500000      5.0      6.0            NaN  
7   0.000000      0.0      0.0            NaN  
8   0.000000      0.0      0.0            NaN  
9   0.000000      0.0      0.0            NaN  
10  4.000000      4.0      4.0            NaN  
11  0.000000      0.0      0.0            NaN  
12  0.000000      0.0      0.0            NaN  
13  0.000000      0.0      0.0            NaN  
14  5.500000      3.0      8.0            NaN

Answered By - jezrael

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 14, 2022

[FIXED] pandas group data at 3 month intervals and aggregate list of functions

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels