Issue
Please consider this dataframe:
import pandas as pd
import numpy as np
values = [0, 22, 30, 0, 20, 22, 11, 0, 13]
index = pd.date_range(start = '2023-10-1', periods = len(values))
df = pd.DataFrame({'values':values }, index = index)
df
values
2023-10-01 0
2023-10-02 22
2023-10-03 30
2023-10-04 0
2023-10-05 20
2023-10-06 22
2023-10-07 11
2023-10-08 0
2023-10-09 13
Goal: create a new column that counts how many days has passed since the last 0 in values
.
I can do this using a for loop:
zero_indices = df[df['values'] == 0].index
df['days'] = np.nan
for i in range(len(zero_indices)-1):
df['days'][zero_indices[i]: zero_indices[i+1]] = range(len(df[zero_indices[i]: zero_indices[i+1]]))
df['days'][zero_indices[-1]: ] = range(len(df[zero_indices[-1]: ]))
values days
2023-10-01 0 0.00
2023-10-02 22 1.00
2023-10-03 30 2.00
2023-10-04 0 0.00
2023-10-05 20 1.00
2023-10-06 22 2.00
2023-10-07 11 3.00
2023-10-08 0 0.00
2023-10-09 13 1.00
Question: How can this be done using vectorization (faster)?
Solution
There will be many ways to do this, one such solution is to use groupby
and cumcount
:
df['temp'] = (df.values == 0).cumsum()
df.groupby(['temp']).cumcount() # this just gives the cumulative count since the last 0 value
Output:
2023-10-01 0
2023-10-02 1
2023-10-03 2
2023-10-04 0
2023-10-05 1
2023-10-06 2
2023-10-07 3
2023-10-08 0
2023-10-09 1
Freq: D, dtype: int64
Answered By - Suraj Shourie
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.