Issue
I am trying to calculate statistics over a shifted/offset rolling window of an inconsistent datetimeindex of a dataset in a pandas dataframe. I want to bring these statistics back to the current datetimeindex. I have a solution but it is computationally inefficient and impractical to run over my large dataset of millions of rows.
Here is a sample of what I want and my method to achieve it.
df = pd.DataFrame({'Col1': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]},
index=pd.DatetimeIndex(['2022-05-25T00:20:00.930','2022-05-25T00:20:01.257','2022-05-25T00:20:01.673','2022-05-25T00:20:03.125','2022-05-25T00:20:04.190',
'2022-05-25T00:20:04.555','2022-05-25T00:20:04.923','2022-05-25T00:20:05.773','2022-05-25T00:20:05.989','2022-05-25T00:20:06.224'],yearfirst=True))
df:
Index Col1
2022-05-25 00:20:00.930 10
2022-05-25 00:20:01.257 15
2022-05-25 00:20:01.673 20
2022-05-25 00:20:03.125 25
2022-05-25 00:20:04.190 30
2022-05-25 00:20:04.555 35
2022-05-25 00:20:04.923 40
2022-05-25 00:20:05.773 45
2022-05-25 00:20:05.989 50
2022-05-25 00:20:06.224 55
With the above dataset, this is my method to get a shifted rolling window at each index.
df['Col1 Avg'] = 0.0
for row in df.index:
offset_t = datetime.timedelta(seconds=1.5)
window_t = datetime.timedelta(seconds=1)
beg = row-offset_t-window_t
end = row-offset_t+window_t
df['Col1 Avg'].loc[row:row] = df['Col1'].loc[beg:end].mean()
df:
Index Col1 Col1 Avg
2022-05-25 00:20:00.930 10 NaN
2022-05-25 00:20:01.257 15 NaN
2022-05-25 00:20:01.673 20 10.0
2022-05-25 00:20:03.125 25 15.0
2022-05-25 00:20:04.190 30 25.0
2022-05-25 00:20:04.555 35 25.0
2022-05-25 00:20:04.923 40 27.5
2022-05-25 00:20:05.773 45 35.0
2022-05-25 00:20:05.989 50 35.0
2022-05-25 00:20:06.224 55 35.0
Is there a way to do this more efficiently? This takes ~5 minutes for just 10,0000 rows whereas a standard rolling window is <0.05 seconds.
Something like this seems like it should work but doesn't (I think) because of the inconsistent datetimeindex entries.
df['shifted avg'] = df['Col1'].shift(-1,freq=offset_t).rolling('2s').mean()
df:
Index Col1 Col1 Avg shifted avg
2022-05-25 00:20:00.930 10 NaN NaN
2022-05-25 00:20:01.257 15 NaN NaN
2022-05-25 00:20:01.673 20 10.0 NaN
2022-05-25 00:20:03.125 25 15.0 NaN
2022-05-25 00:20:04.190 30 25.0 NaN
2022-05-25 00:20:04.555 35 25.0 NaN
2022-05-25 00:20:04.923 40 27.5 NaN
2022-05-25 00:20:05.773 45 35.0 NaN
2022-05-25 00:20:05.989 50 35.0 NaN
2022-05-25 00:20:06.224 55 35.0 NaN
Solution
If you resample to 1ms, you can then take a 2 second rolling and offset by 500ms. Then since you have a record for every ms, and that's the resolution of your original index, you can merge them together to get the correct answers.
import pandas as pd
df = pd.DataFrame({'Col1': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]},
index=pd.DatetimeIndex(['2022-05-25T00:20:00.930','2022-05-25T00:20:01.257','2022-05-25T00:20:01.673','2022-05-25T00:20:03.125','2022-05-25T00:20:04.190',
'2022-05-25T00:20:04.555','2022-05-25T00:20:04.923','2022-05-25T00:20:05.773','2022-05-25T00:20:05.989','2022-05-25T00:20:06.224'],yearfirst=True))
df = df.merge(df.resample('1ms')
.min()
.rolling('2S')
.mean()
.shift(500)
.rename(columns={'Col1':'Col1 Avg'}),
left_index=True,
right_index=True)
print(df)
Output
Col1 Col1 Avg
2022-05-25 00:20:00.930 10 NaN
2022-05-25 00:20:01.257 15 NaN
2022-05-25 00:20:01.673 20 10.0
2022-05-25 00:20:03.125 25 15.0
2022-05-25 00:20:04.190 30 25.0
2022-05-25 00:20:04.555 35 25.0
2022-05-25 00:20:04.923 40 27.5
2022-05-25 00:20:05.773 45 35.0
2022-05-25 00:20:05.989 50 35.0
2022-05-25 00:20:06.224 55 35.0
Answered By - Chris
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.