Tuesday, June 28, 2022

[FIXED] cumsum based on taking first value from another column the creating a new calculation

June 28, 2022 numpy, pandas, python No comments

Issue

I was hoping to get some help with a calculation I'm struggling a bit with. I'm working with some data (copied below) and I need create a calculation that takes the first value > 0 from another column and computes a new series based on that value, and then aggregates the numbers giving a cumulative sum. My raw data looks like this:

d = {'Final Account': ['A', 'A', 'A' ,'A' , 'A', 'A', 'A','A' ,'A' ,'A', 'A', 'A', 'A'], 
     'Date': ['Jun-21','Jul-21','Aug-21','Sep-21','Oct-21','Nov-21','Dec-21','Jan-22','Feb-22','Mar-22','Apr-22','May-22','Jun-22'],
     'Units':[0, 0, 0, 0, 10, 0, 20, 0, 0, 7, 12, 35, 0]}
df = pd.DataFrame(data=d)

Account Date    Units
A       Jun-21  0
A       Jul-21  0
A       Aug-21  0
A       Sep-21  0
A       Oct-21  10
A       Nov-21  0
A       Dec-21  20
A       Jan-22  0
A       Feb-22  0
A       Mar-22  7
A       Apr-22  12
A       May-22  35
A       Jun-22  0

To the table I do an initial conversion for my data which is:

df['Conv'] = df['Units'].apply(x/5)

This adds a new column to my table like this:

   Account  Date    Units   Conv
   A        Jun-21  0       0
   A        Jul-21  0       0
   A        Aug-21  0       0
   A        Sep-21  0       0
   A        Oct-21  10      2
   A        Nov-21  0       0 
   A        Dec-21  20      4 
   A        Jan-22  0       0
   A        Feb-22  0       0
   A        Mar-22  7       1
   A        Apr-22  12      2
   A        May-22  35      7
   A        Jun-22  0       0

The steps after this I begin to run into issues. I need to calculate new field which takes the first value of the conv field > 0, at the same index position and begins a new calculation based on the previous rows cumsum and then adds it back into the cumsum following the calculation. Outside of python this is done by creating two columns. One to calculate new units by:

(Units - (previous row cumsum of existing units * 2))/5

Then existing units which is just the cumsum of the values that have been figured out to be new units. The desired output should look something like this:

   Account  Date    Units   Conv  New Units  Existing Units (cumsum of new units)
   A        Jun-21  0       0     0          0
   A        Jul-21  0       0     0          0
   A        Aug-21  0       0     0          0
   A        Sep-21  0       0     0          0 
   A        Oct-21  10      2     2          2 
   A        Nov-21  0       0     0          2
   A        Dec-21  20      4     3          5
   A        Jan-22  0       0     0          5
   A        Feb-22  0       0     0          5
   A        Mar-22  7       1     0          5
   A        Apr-22  12      2     0          5
   A        May-22  35      7     5          10
   A        Jun-22  0       0     0          10

The main issue I'm struggling with is grabbing the first value >0 from the "Conv" column and being able to create a new cumsum based on that initial value that can be applied to the "New Units" calculation. Any guidance is much appreciated, and despite reading a lot around I've hit a bit of a brick wall! If you need me to explain better please do ask! :)

Much appreciated in advance!

Solution

I'm not sure that I completely understand what you are trying to achieve. Nevertheless, here's an attempt to reproduce your expected results. For your example frame this

groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()
df['New Units'] = 0
last = 0
for _, group in df['Units'].groupby(groups):
    i, unit = group.index[-1], group.iloc[-1]
    if unit != 0:
        new_unit = (unit - last * 2) // 5
        last = df.at[i, 'New Units'] = new_unit

does result in

   Final Account    Date  Units  New Units
0              A  Jun-21      0          0
1              A  Jul-21      0          0
2              A  Aug-21      0          0
3              A  Sep-21      0          0
4              A  Oct-21     10          2
5              A  Nov-21      0          0
6              A  Dec-21     20          3
7              A  Jan-22      0          0
8              A  Feb-22      0          0
9              A  Mar-22      7          0
10             A  Apr-22     12          0
11             A  May-22     35          5
12             A  Jun-22      0          0

The first step identifies the blocks in column Units whose last item is relevant for building the new units: Successive zeros, followed by non-zeros, until the first zero. This

groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()

results in

Then group column Units along these blocks, grab the last item if each block if it is non-zero (zero can only happen in the last block), build the new unit (according to the given formula) and store it in the new column New Units.

(If you actually need the column Existing Units then just use .cumsum() on the column New Units.)

If there are multiple accounts (indicated in the comments), then one way to apply the procedure to each account separately would be to pack it into a function (here new_units), .groupby() over the Final Account column, and .apply() the function to the groups:

def new_units(sdf):
    groups = (sdf['Units'].eq(0) & sdf['Units'].shift().ne(0)).cumsum()
    last = 0
    for _, group in sdf['Units'].groupby(groups):
        i, unit = group.index[-1], group.iloc[-1]
        if unit != 0:
            new_unit = (unit - last * 2) // 5
            last = sdf.at[i, 'New Units'] = new_unit
    return sdf

df['New Units'] = 0
df = df.groupby('Final Account').apply(new_units)

Answered By - Timus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, June 28, 2022

[FIXED] cumsum based on taking first value from another column the creating a new calculation

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels