Issue
I have a dataframe that looks like this:
ID DATE QTD
0 71896517 2020-07-25 1
1 71896517 2020-09-14 2
2 72837949 2020-09-21 1
3 72848188 2020-11-03 1
4 73307986 2020-11-04 1
5 72848188 2020-11-16 1
6 71896517 2020-11-22 1
7 73307986 2020-11-25 1
8 73307986 2021-01-04 1
9 73307986 2021-02-04 1
10 72848188 2021-02-07 1
11 72837949 2021-02-11 1
I want to add a column to the dataframe for the monthly mean of the QTD
column.
This metrics should be calculated for each ID
individually (so every ID
has its own values) and should incremement as the dataframe grows in DATE
(which is the year-month-day).
If the ID
doesn't have activity in a certain month, I want that month to still be included on the calculated metrics.
I want the end dataframe to look like this:
ID DATE QTD MEAN
0 71896517 2020-07-25 1 1.0
1 71896517 2020-09-14 2 1.0
2 72837949 2020-09-21 1 1.0
3 72848188 2020-11-03 1 1.0
4 73307986 2020-11-04 1 1.0
5 72848188 2020-11-16 1 2.0
6 71896517 2020-11-22 1 0.8
7 73307986 2020-11-25 1 2.0
8 73307986 2021-01-04 1 1.0
9 73307986 2021-02-04 1 1.0
10 72848188 2021-02-07 1 0.75
11 72837949 2021-02-11 1 0.33
How can I go about to implement this?
Solution
There's usually an iterative solution.
import pandas as pd
import numpy as np
def monthdelta(a,b):
a1,a2,a3 = (int(k) for k in a.split('-'))
b1,b2,b3 = (int(k) for k in b.split('-'))
return (a1*12+a2) - (b1*12+b2)
data = [
[ 71896517, "2020-07-25", 1 ],
[ 71896517, "2020-09-14", 2 ],
[ 72837949, "2020-09-21", 1 ],
[ 72848188, "2020-11-03", 1 ],
[ 73307986, "2020-11-04", 1 ],
[ 72848188, "2020-11-16", 1 ],
[ 71896517, "2020-11-22", 1 ],
[ 73307986, "2020-11-25", 1 ],
[ 73307986, "2021-01-04", 1 ],
[ 73307986, "2021-02-04", 1 ],
[ 72848188, "2021-02-07", 1 ],
[ 72837949, "2021-02-11", 1 ],
]
df = pd.DataFrame( data, columns=["ID", "DATE", "QTD"] )
startdate = {}
sums = {}
sumsqs = {}
num = {}
stdev = []
means = []
for row in df.T.iteritems():
id = row[1]['ID']
if id not in startdate:
num[id] = 1
startdate[id] = row[1]['DATE']
sums[id] = row[1]['QTD']
sumsqs[id] = row[1]['QTD'] * row[1]['QTD']
means.append( row[1]['QTD'] )
stdev.append( 0 )
else:
num[id] += 1
sums[id] += row[1]['QTD']
sumsqs[id] += row[1]['QTD'] * row[1]['QTD']
delta = monthdelta(row[1]['DATE'],startdate[id]) + 1
means.append( sums[id] / delta )
if delta == 1:
stdev.append( 0 )
else:
stdev.append( np.sqrt((delta*sumsqs[id] - sums[id]*sums[id])/delta))
print( means )
df['MEAN'] = pd.Series(means)
print( stdev )
df['STDEV'] = pd.Series(stdev)
print( df )
Output:
[1, 1.0, 1, 1, 1, 2.0, 0.8, 2.0, 1.0, 1.0, 0.75, 0.3333333333333333]
[0, 1.4142135623730951, 0, 0, 0, 0, 1.6733200530681511, 0, 0.0, 0.0, 0.8660254037844386, 1.1547005383792515]
ID DATE QTD MEAN STDEV
0 71896517 2020-07-25 1 1.000000 0.000000
1 71896517 2020-09-14 2 1.000000 1.414214
2 72837949 2020-09-21 1 1.000000 0.000000
3 72848188 2020-11-03 1 1.000000 0.000000
4 73307986 2020-11-04 1 1.000000 0.000000
5 72848188 2020-11-16 1 2.000000 0.000000
6 71896517 2020-11-22 1 0.800000 1.673320
7 73307986 2020-11-25 1 2.000000 0.000000
8 73307986 2021-01-04 1 1.000000 0.000000
9 73307986 2021-02-04 1 1.000000 0.000000
10 72848188 2021-02-07 1 0.750000 0.866025
11 72837949 2021-02-11 1 0.333333 1.154701
Answered By - Tim Roberts
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.