Wednesday, December 15, 2021

[FIXED] Calculate Statistical Metrics On Time-Series Data

December 15, 2021 jupyter-notebook, numpy, pandas, python, time-series No comments

Issue

I have a dataframe that looks like this:

        ID          DATE       QTD
0       71896517    2020-07-25  1
1       71896517    2020-09-14  2
2       72837949    2020-09-21  1
3       72848188    2020-11-03  1
4       73307986    2020-11-04  1
5       72848188    2020-11-16  1
6       71896517    2020-11-22  1
7       73307986    2020-11-25  1
8       73307986    2021-01-04  1
9       73307986    2021-02-04  1
10      72848188    2021-02-07  1
11      72837949    2021-02-11  1

I want to add a column to the dataframe for the monthly mean of the QTD column. This metrics should be calculated for each ID individually (so every ID has its own values) and should incremement as the dataframe grows in DATE (which is the year-month-day). If the ID doesn't have activity in a certain month, I want that month to still be included on the calculated metrics.

I want the end dataframe to look like this:

        ID          DATE       QTD    MEAN
0       71896517    2020-07-25  1     1.0
1       71896517    2020-09-14  2     1.0
2       72837949    2020-09-21  1     1.0
3       72848188    2020-11-03  1     1.0
4       73307986    2020-11-04  1     1.0
5       72848188    2020-11-16  1     2.0
6       71896517    2020-11-22  1     0.8
7       73307986    2020-11-25  1     2.0
8       73307986    2021-01-04  1     1.0
9       73307986    2021-02-04  1     1.0
10      72848188    2021-02-07  1     0.75
11      72837949    2021-02-11  1     0.33

How can I go about to implement this?

Solution

There's usually an iterative solution.

import pandas as pd
import numpy as np

def monthdelta(a,b):
    a1,a2,a3 = (int(k) for k in a.split('-'))
    b1,b2,b3 = (int(k) for k in b.split('-'))
    return (a1*12+a2) - (b1*12+b2)

data = [
[ 71896517, "2020-07-25", 1 ],
[ 71896517, "2020-09-14", 2 ],
[ 72837949, "2020-09-21", 1 ],
[ 72848188, "2020-11-03", 1 ],
[ 73307986, "2020-11-04", 1 ],
[ 72848188, "2020-11-16", 1 ],
[ 71896517, "2020-11-22", 1 ],
[ 73307986, "2020-11-25", 1 ],
[ 73307986, "2021-01-04", 1 ],
[ 73307986, "2021-02-04", 1 ],
[ 72848188, "2021-02-07", 1 ],
[ 72837949, "2021-02-11", 1 ],
]

df = pd.DataFrame( data, columns=["ID", "DATE", "QTD"] )

startdate = {}
sums = {}
sumsqs = {}
num = {}
stdev = []
means = []

for row in df.T.iteritems():
    id = row[1]['ID']
    if id not in startdate:
        num[id] = 1
        startdate[id] = row[1]['DATE']
        sums[id] = row[1]['QTD']
        sumsqs[id] = row[1]['QTD'] * row[1]['QTD']
        means.append( row[1]['QTD'] )
        stdev.append( 0 )
    else:
        num[id] += 1
        sums[id] += row[1]['QTD']
        sumsqs[id] += row[1]['QTD'] * row[1]['QTD']
        delta = monthdelta(row[1]['DATE'],startdate[id]) + 1
        means.append( sums[id] / delta )
        if delta == 1:
            stdev.append( 0 )
        else:
            stdev.append( np.sqrt((delta*sumsqs[id] - sums[id]*sums[id])/delta))

print( means )
df['MEAN'] = pd.Series(means)
print( stdev )
df['STDEV'] = pd.Series(stdev)
print( df )

Output:

[1, 1.0, 1, 1, 1, 2.0, 0.8, 2.0, 1.0, 1.0, 0.75, 0.3333333333333333]
[0, 1.4142135623730951, 0, 0, 0, 0, 1.6733200530681511, 0, 0.0, 0.0, 0.8660254037844386, 1.1547005383792515]
          ID        DATE  QTD      MEAN     STDEV
0   71896517  2020-07-25    1  1.000000  0.000000
1   71896517  2020-09-14    2  1.000000  1.414214
2   72837949  2020-09-21    1  1.000000  0.000000
3   72848188  2020-11-03    1  1.000000  0.000000
4   73307986  2020-11-04    1  1.000000  0.000000
5   72848188  2020-11-16    1  2.000000  0.000000
6   71896517  2020-11-22    1  0.800000  1.673320
7   73307986  2020-11-25    1  2.000000  0.000000
8   73307986  2021-01-04    1  1.000000  0.000000
9   73307986  2021-02-04    1  1.000000  0.000000
10  72848188  2021-02-07    1  0.750000  0.866025
11  72837949  2021-02-11    1  0.333333  1.154701

Answered By - Tim Roberts

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 15, 2021

[FIXED] Calculate Statistical Metrics On Time-Series Data

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels