Sunday, November 7, 2021

[FIXED] How to perform a Window Function operation in a Pandas Dataframe using a cumulative sum?

November 07, 2021 apache-spark, dataframe, group-by, pandas, pyspark No comments

Issue

I have an initial dataframe

df1 = 
+---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  1| 10|
|  1|  2| 11|
|  1|  2| 12|
|  3|  1| 13|
|  2|  1| 14|
|  2|  1| 15|
|  2|  1| 16|
|  4|  1| 17|
|  4|  2| 18|
|  4|  3| 19|
|  4|  4| 19|
|  4|  5| 20|
|  4|  5| 20|
+---+---+---+

Using pyspark I coded the dataframe with a window function using a sum taking into account the column 'A' and taking into account the column 'B' sorted.

spec = Window.partitionBy('A').orderBy('B')
df1 = df1.withColumn('D',sum('C').over(spec))

df1.show()

+---+---+---+-----+
|  A|  B|  C|    D|
+---+---+---+-----+
|  1|  1| 10| 10.0|
|  1|  2| 11| 33.0|
|  1|  2| 12| 33.0|
|  2|  1| 14| 45.0|
|  2|  1| 15| 45.0|
|  2|  1| 16| 45.0|
|  3|  1| 13| 13.0|
|  4|  1| 17| 17.0|
|  4|  2| 18| 35.0|
|  4|  3| 19| 54.0|
|  4|  4| 19| 73.0|
|  4|  5| 20|113.0|
|  4|  5| 20|113.0|
+---+---+---+-----+

Is it possible to do the same calculation using Pandas Dataframe?

I tried using

df['D'] = df.sort_values(['A','B']).groupby(['A', 'B'])['C'].transform('cumsum')

but it is not the same result

Thanks

Solution

In pandas we can groupby sum on A and B. Then groupby cumsum just over A. To add results back to the DataFrame rename to the new column name then join the results back to the DataFrame on the initial group keys ['A', 'B']:

df1 = df1.join(
    df1.groupby(by=['A', 'B'])['C'].sum()
        .groupby(level='A').cumsum()
        .rename('D'),
    on=['A', 'B']
)

df1:

    A  B   C    D
0   1  1  10   10
1   1  2  11   33
2   1  2  12   33
3   3  1  13   13
4   2  1  14   45
5   2  1  15   45
6   2  1  16   45
7   4  1  17   17
8   4  2  18   35
9   4  3  19   54
10  4  4  19   73
11  4  5  20  113
12  4  5  20  113

Setup:

import pandas as pd

df1 = pd.DataFrame({
    'A': [1, 1, 1, 3, 2, 2, 2, 4, 4, 4, 4, 4, 4],
    'B': [1, 2, 2, 1, 1, 1, 1, 1, 2, 3, 4, 5, 5],
    'C': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 19, 20, 20]
})

Answered By - Henry Ecker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, November 7, 2021

[FIXED] How to perform a Window Function operation in a Pandas Dataframe using a cumulative sum?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels