Thursday, December 28, 2023

[FIXED] conditional sums for pandas aggregate

December 28, 2023 pandas, python No comments

Issue

I just recently made the switch from R to python and have been having some trouble getting used to data frames again as opposed to using R's data.table. The problem I've been having is that I'd like to take a list of strings, check for a value, then sum the count of that string- broken down by user. So I would like to take this data:

   A_id       B    C
1:   a1    "up"  100
2:   a2  "down"  102
3:   a3    "up"  100
3:   a3    "up"  250
4:   a4  "left"  100
5:   a5 "right"  102

And return:

   A_id_grouped   sum_up   sum_down  ...  over_200_up
1:           a1        1          0  ...            0
2:           a2        0          1                 0
3:           a3        2          0  ...            1
4:           a4        0          0                 0
5:           a5        0          0  ...            0

Before I did it with the R code (using data.table)

>DT[ ,list(A_id_grouped, sum_up = sum(B == "up"),
+  sum_down = sum(B == "down"), 
+  ...,
+  over_200_up = sum(up == "up" & < 200), by=list(A)];

However all of my recent attempts with Python have failed me:

DT.agg({"D": [np.sum(DT[DT["B"]=="up"]),np.sum(DT[DT["B"]=="up"])], ...
    "C": np.sum(DT[(DT["B"]=="up") & (DT["C"]>200)])
    })

Thank you in advance! it seems like a simple question however I couldn't find it anywhere.

Solution

An old question; I feel a better way, and avoiding the apply, would be to create a new dataframe, before grouping and aggregating:


df = df.set_index('A_id')

outcome = {'sum_up' : df.B.eq('up'),
           'sum_down': df.B.eq('down'),
           'over_200_up' : df.B.eq('up') & df.C.gt(200)}

outcome = pd.DataFrame(outcome).groupby(level=0).sum()

outcome
 
      sum_up  sum_down  over_200_up
A_id                               
a1         1         0            0
a2         0         1            0
a3         2         0            1
a4         0         0            0
a5         0         0            0

Another option would be to unstack before grouping; however, I feel it is a longer, unnecessary process:

(df
  .set_index(['A_id', 'B'], append = True)
  .C
  .unstack('B')
  .assign(gt_200 = lambda df: df.up.gt(200))
  .groupby(level='A_id')
  .agg(sum_up=('up', 'count'), 
       sum_down =('down', 'count'), 
       over_200_up = ('gt_200', 'sum')
      )
)

      sum_up  sum_down  over_200_up
A_id                               
a1         1         0            0
a2         0         1            0
a3         2         0            1
a4         0         0            0
a5         0         0            0

Answered By - sammywemmy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 28, 2023

[FIXED] conditional sums for pandas aggregate

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels