Monday, January 17, 2022

[FIXED] Perform addition calculation in a column based on another (grouping by a specific column)

January 17, 2022 numpy, pandas, python No comments

Issue

I have a dataframe, df, where I would like to perform an addition based on values in a different column. The addition calculation is based on the id value.

Data

id  date    value   start   status
aa  Q1.22   hello   80  
aa  Q2.22   hi     -200     2000
aa  Q3.22   hey    -400 
aa  Q4.22   sure   -500 
bb  Q3.23   ok      30  
bb  Q4.23   sure    10      100

Desired

id  date    value   start   status  update
aa  Q1.22   hello   80      
aa  Q2.22   hi     -200     2000    1800
aa  Q3.22   hey    -400             1400
aa  Q4.22   sure   -500             900
bb  Q3.23   ok      30      
bb  Q4.23   sure    10      100     110

Logic

The addition will begin at the first display of a single value. For example we see that the number 2000 is a single non consecutive value. The start point of the addition calculation will begin here. Similar to the 100 value (it is an ending value)

We are just adding the values in the 'start' column to the value in the 'status' column. The adding ends with the id. The adding calculation is grouped by id.

Doing

df = df.groupby(["id"]).sum()

df['update'] = df.groupby(["id"]).sum()['start']

However,

I am still researching this, any suggestion is appreciated.

Solution

to_numeric to convert status to numeric with NaN values
groupby ffill the status column this will fill move numeric values forward (leaving starting NaN as NaN
Series.mask to remove unwanted values from start then groupby cumsum
add to the filled status columns.

df['status'] = pd.to_numeric(df['status'], errors='coerce')
status = df.groupby('id')['status'].ffill()
df['update'] = (
        status + df['start'].mask(status.isna()).groupby(df['id']).cumsum()
)

df:

   id   date  value  start  status  update
0  aa  Q1.22  hello     80     NaN     NaN
1  aa  Q2.22     hi   -200  2000.0  1800.0
2  aa  Q3.22    hey   -400     NaN  1400.0
3  aa  Q4.22   sure   -500     NaN   900.0
4  bb  Q3.23     ok     30     NaN     NaN
5  bb  Q4.23   sure     10   100.0   110.0

Optionally, DataFrame.fillna to add back the empty strings:

df = df.fillna('')

df:

   id   date  value  start  status  update
0  aa  Q1.22  hello     80                
1  aa  Q2.22     hi   -200  2000.0  1800.0
2  aa  Q3.22    hey   -400          1400.0
3  aa  Q4.22   sure   -500           900.0
4  bb  Q3.23     ok     30                
5  bb  Q4.23   sure     10   100.0   110.0

Breakdown of Steps:

(Step 2 Forward Fill status in groups)

df.groupby('id')['status'].ffill()

0       NaN
1    2000.0
2    2000.0
3    2000.0
4       NaN
5     100.0
Name: status, dtype: float64

(Step 3.1 Remove unwanted values from start)

df['start'].mask(status.isna())

0      NaN
1   -200.0
2   -400.0
3   -500.0
4      NaN
5     10.0
Name: start, dtype: float64

(Step 3.2 cumsum in groups the kept values)

df['start'].mask(status.isna()).groupby(df['id']).cumsum()

0       NaN
1    -200.0
2    -600.0
3   -1100.0
4       NaN
5      10.0
Name: start, dtype: float64

(Step 4 add to forward filled status)

status + df['start'].mask(status.isna()).groupby(df['id']).cumsum()

0       NaN
1    1800.0
2    1400.0
3     900.0
4       NaN
5     110.0
dtype: float64

Answered By - Henry Ecker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 17, 2022

[FIXED] Perform addition calculation in a column based on another (grouping by a specific column)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels