Issue
I have a dataframe, df, where I would like to perform an addition based on values in a different column. The addition calculation is based on the id value.
Data
id date value start status
aa Q1.22 hello 80
aa Q2.22 hi -200 2000
aa Q3.22 hey -400
aa Q4.22 sure -500
bb Q3.23 ok 30
bb Q4.23 sure 10 100
Desired
id date value start status update
aa Q1.22 hello 80
aa Q2.22 hi -200 2000 1800
aa Q3.22 hey -400 1400
aa Q4.22 sure -500 900
bb Q3.23 ok 30
bb Q4.23 sure 10 100 110
Logic
The addition will begin at the first display of a single value. For example we see that the number 2000 is a single non consecutive value. The start point of the addition calculation will begin here. Similar to the 100 value (it is an ending value)
We are just adding the values in the 'start' column to the value in the 'status' column. The adding ends with the id. The adding calculation is grouped by id.
Doing
df = df.groupby(["id"]).sum()
df['update'] = df.groupby(["id"]).sum()['start']
However,
I am still researching this, any suggestion is appreciated.
Solution
to_numeric
to convertstatus
to numeric withNaN
valuesgroupby ffill
thestatus
column this will fill move numeric values forward (leaving startingNaN
asNaN
Series.mask
to remove unwanted values fromstart
thengroupby cumsum
- add to the filled
status
columns.
df['status'] = pd.to_numeric(df['status'], errors='coerce')
status = df.groupby('id')['status'].ffill()
df['update'] = (
status + df['start'].mask(status.isna()).groupby(df['id']).cumsum()
)
df
:
id date value start status update
0 aa Q1.22 hello 80 NaN NaN
1 aa Q2.22 hi -200 2000.0 1800.0
2 aa Q3.22 hey -400 NaN 1400.0
3 aa Q4.22 sure -500 NaN 900.0
4 bb Q3.23 ok 30 NaN NaN
5 bb Q4.23 sure 10 100.0 110.0
Optionally, DataFrame.fillna
to add back the empty strings:
df = df.fillna('')
df
:
id date value start status update
0 aa Q1.22 hello 80
1 aa Q2.22 hi -200 2000.0 1800.0
2 aa Q3.22 hey -400 1400.0
3 aa Q4.22 sure -500 900.0
4 bb Q3.23 ok 30
5 bb Q4.23 sure 10 100.0 110.0
Breakdown of Steps:
(Step 2 Forward Fill status
in groups)
df.groupby('id')['status'].ffill()
0 NaN
1 2000.0
2 2000.0
3 2000.0
4 NaN
5 100.0
Name: status, dtype: float64
(Step 3.1 Remove unwanted values from start
)
df['start'].mask(status.isna())
0 NaN
1 -200.0
2 -400.0
3 -500.0
4 NaN
5 10.0
Name: start, dtype: float64
(Step 3.2 cumsum
in groups the kept values)
df['start'].mask(status.isna()).groupby(df['id']).cumsum()
0 NaN
1 -200.0
2 -600.0
3 -1100.0
4 NaN
5 10.0
Name: start, dtype: float64
(Step 4 add to forward filled status
)
status + df['start'].mask(status.isna()).groupby(df['id']).cumsum()
0 NaN
1 1800.0
2 1400.0
3 900.0
4 NaN
5 110.0
dtype: float64
Answered By - Henry Ecker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.