Issue
I have
- a list of dates (size=7713):
[date(1994,5,25), ..., date(2023,12,19)]
- a list of times (basically 5 minute interval of a day, size=288):
[time(0, 0), ..., time(23, 55)]
I want to build a DatetimeIndex, where its full "product" combine, i.e., combine all time with each date. In the end, the size of the index is 7713 * 288.
My current solution is
datetimes = [pd.Timestamp.combine(d, t) for d in dates for t in times]
index = pd.DatetimeIndex(datetimes)
However it is super slow
- combining takes around 9 seconds
- building index takes around 9 seconds again
I tried directly give datetimes to pd.DataFrame(index=datetimes, ....)
, it takes 6 seconds (better than 9 seconds if I build index beforehand).
Do we have much faster way for this?
Solution
Your code is slow because it runs pd.Timestamp.combine
in a double for loop.
Instead, use pd.merge
with type cross
to build the "product" first. Then do either datetime.combine
or a string summation as vector operations to create a joint set. Is either one of these two faster?
from datetime import date, time, datetime
import pandas as pd
a_df = pd.DataFrame([date(1994,5,25), date(1999,6,14), date(2005,12,8)])
b_df = pd.DataFrame([time(0, 0), time(16, 24), time(23, 55)])
# step 1: build product with "merge"
tmp = pd.merge(a_df, b_df, how='cross')
# step 2a: collapse both cols with "datetime.combine"
my_index_1 = tmp.apply(lambda r : datetime.combine(r['0_x'],r['0_y']),1)
# step 2b: alternatively, collapse both cols via string conversion
my_index_2 = pd.to_datetime(tmp['0_x'].astype(str) + ' ' + tmp['0_y'].astype(str))
This gives you the desired result:
print(my_index)
0 1994-05-25 00:00:00
1 1994-05-25 16:24:00
2 1994-05-25 23:55:00
3 1999-06-14 00:00:00
4 1999-06-14 16:24:00
5 1999-06-14 23:55:00
6 2005-12-08 00:00:00
7 2005-12-08 16:24:00
8 2005-12-08 23:55:00
dtype: datetime64[ns]
Or, keep the date and time separate as a multiindex (depending on what analysis is coming down the road, maybe pivoting directly on time of day is good enough). Should spare you the effort/time needed to combine
:
tmp.set_index(['0_x', '0_y']).index
MultiIndex([(1994-05-25, 00:00:00),
(1994-05-25, 16:24:00),
(1994-05-25, 23:55:00),
(1999-06-14, 00:00:00),
(1999-06-14, 16:24:00),
(1999-06-14, 23:55:00),
(2005-12-08, 00:00:00),
(2005-12-08, 16:24:00),
(2005-12-08, 23:55:00)],
names=['0_x', '0_y'])
Answered By - KingOtto
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.