Issue
I have two dataframes with date time information, df_stream
is a stream of events and df_events
are specific events in time. For example:
The blue is the stream and the red line is the event. The window is 30min
either side of the event and the time granularity is 30s
.
How do I produce a dataframe / graph of the average behaviour of the stream for all events for a given time window and time granularity?
The stream dataframe looks like:
streamEventId
DateTime
2020-08-20 10:39:24 44791313
2020-08-20 10:40:30 44791721
2020-08-20 10:40:54 44694121
2020-08-20 10:41:16 44902962
2020-08-20 10:42:04 44622569
The event dataframe looks like:
DateTime
0 2020-11-17 09:49:00
1 2020-11-17 10:49:00
2 2020-11-17 11:11:00
3 2020-11-17 11:16:00
4 2020-11-17 12:11:00
I've managed to get graphs of each event with their respective windows printed but am struggling to combine the logic. Also I use iterrows
which I am not a fan of.
My current approach:
for i in df_events[["DateTime"]].iterrows():
date_time = i[1].values[0]
before = date_time - pd.Timedelta(window)
after = date_time + pd.Timedelta(window)
df_stream_temp = df_stream.loc[before:after].copy()
plt.figure(figsize=(20, 2))
df_stream_mva = (
df_stream.streamEventId.groupby(pd.Grouper(freq="30s"))
.count()
.loc[before:after]
)
y_height = df_stream_mva.max()
ax = df_stream_mva.plot()
plt.vlines(df_events.DateTime.to_list(), 0, y_height, color="lightcoral")
ax.set_ylim([0, y_height])
Which gives a nice series of graphs:
And so on...
I would like to calculate and then plot the average of the above with confidence intervals.
Solution
Sat down and worked it out.
Using a similar logic to my iterrows
example above, I looped through df_events
, created before
and after
time window limits and used these to filter df_stream
grouped by a given time frequency. Then recalculated the index using time delta indexes to get a "distance from event" index. This was then appended to a list that was then combined using pd.concat
with an axis of 1. Example below:
window = "30 min"
freq = "30s"
collect_list = []
for i in df_events[["DateTime"]].iterrows():
date_time = i[1].values[0]
before = date_time - pd.Timedelta(window)
after = date_time + pd.Timedelta(window)
df_stream_window = (
df_stream.streamEventId.groupby(pd.Grouper(freq=freq))
.count()
.loc[before:after]
.reset_index()
.rename(columns={"streamEventId": "stream events"})
)
df_stream_window = df_stream_window.set_index(
pd.to_timedelta((df_stream_window.DateTime - pd.to_datetime(date_time)))
/ np.timedelta64("1", "m")
)
collect_list.append(df_stream_window)
df_collect = pd.concat(collect_list, axis=1)
This generated the below dataframe:
DateTime stream events DateTime \
DateTime
-30.0 2020-11-17 09:19:00 5 2020-11-17 10:19:00
-29.5 2020-11-17 09:19:30 5 2020-11-17 10:19:30
-29.0 2020-11-17 09:20:00 3 2020-11-17 10:20:00
-28.5 2020-11-17 09:20:30 2 2020-11-17 10:20:30
-28.0 2020-11-17 09:21:00 3 2020-11-17 10:21:00
stream events DateTime stream events \
DateTime
-30.0 3 2020-11-17 10:41:00 6
-29.5 5 2020-11-17 10:41:30 6
-29.0 3 2020-11-17 10:42:00 4
-28.5 3 2020-11-17 10:42:30 1
-28.0 1 2020-11-17 10:43:00 4
....
This can then be used in plotting with estimator="mean"
in Seaborn:
ax = sns.lineplot(data=df_collect, estimator="mean")
plt.axvline(0, color="lightcoral")
ax.set_xlabel("Time before / after event (min)")
ax.set_ylabel("# Stream Events")
Answered By - philMarius
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.