Issue
I'm struggling to find an elegant solution to create a running total column in my dataframe. It should start the count if two criteria are met and reset any time they aren't.
If the user in the example frame below is the same as the row before and the 'Value Col' is 0, the running total should start and increase by one for every row until either the user changes or the Value Col is NOT 0.
This is being run on a very large dataset (30+ million rows), so I'm hoping there can be a solution using built in, optimised functions, but I can brute force it with .apply if that's the only option.
Example:
User | Value Col | Running total |
---|---|---|
One | 2 | 0 |
One | 0 | 1 |
One | 0 | 2 |
One | 0 | 3 |
One | 1 | 0 |
One | 3 | 0 |
One | 0 | 1 |
One | 0 | 2 |
Two | 0 | 1 |
Two | 0 | 2 |
Two | 0 | 3 |
Two | 3 | 0 |
Two | 0 | 1 |
Two | 0 | 2 |
Solution
There's a common trick for it in pandas: use cumsum
on a boolean mask to create groups of consecutive rows. Then use a groupby
+ cumcount
to label the values inside each group!
import pandas as pd
df = pd.read_clipboard() # Your df here
groups = df["Value Col"].ne(0).cumsum()
df["Running total"] = df.groupby(["User", groups]).cumcount()
df
:
User Value Col Running total
0 One 2 0
1 One 0 1
2 One 0 2
3 One 0 3
4 One 1 0
5 One 3 0
6 One 0 1
7 One 0 2
8 Two 0 0 # <-- The count resets to 0 here,
9 Two 0 1 # <-- in your example we have 1, 2, 3 instead,
10 Two 0 2 # <-- is that a mistake? or intentional?
11 Two 3 0
12 Two 0 1
13 Two 0 2
Answered By - Chrysophylaxs
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.