Issue
I am using pandas version 1.0.5
import pandas as pd
dat1 = [
['2023-12-27','2023-12-27 00:00:00','2023-12-27 02:14:00'],
['2023-12-27','2023-12-27 03:16:00','2023-12-27 04:19:00'],
['2023-12-27','2023-12-27 18:11:00','2023-12-27 20:13:00'],
['2023-12-28','2023-12-28 01:16:00','2023-12-28 02:14:00'],
['2023-12-28','2023-12-28 02:16:00','2023-12-28 02:28:00'],
['2023-12-28','2023-12-28 02:30:00','2023-12-28 02:56:00'],
['2023-12-28','2023-12-28 18:45:00','2023-12-28 19:00:00'],
['2023-12-29','2023-12-29 01:16:00','2023-12-29 02:13:00'],
['2023-12-29','2023-12-29 04:16:00','2023-12-29 05:09:00'],
['2023-12-29','2023-12-29 05:11:00','2023-12-29 05:14:00'],
['2023-12-29','2023-12-29 18:00:00','2023-12-29 19:00:00']
]
df = pd.DataFrame(dat1,columns = ['date','Start_tmp','End_tmp'])
df["Start_tmp"] = pd.to_datetime(df["Start_tmp"])
df["End_tmp"] = pd.to_datetime(df["End_tmp"])
My dataframe looks like this:
I need to find common or overlapping interval between the timestamps.
For example, one of the overlapping time across all the three dates (yellow highlighted) is 1:16 - 2:13. The other (blue highlighted) would be 18:45 - 19:00
So my expected output would be like:
[57,15]
57 - Minutes between 1:16 - 2:13.
15 - Minutes between 18:45 - 19:00
Any clue how this output can be achieved. Thanks.
Solution
This solution uses:
- numpy, no uncommon Python modules, so using Python 1.0.5 you should, hopefully, be in the clear,
- no nested loops to care for speed issues with growing dataset,
Method:
- Draw the landscape of overlaps
- Then select the overlaps corresponding to the number of documented days,
- Finally describe the overlaps in terms of their lengths
Number of documented days: (as in Python: Convert timedelta to int in a dataframe)
n = 1 + ( max(df['End_tmp']) - min(df['Start_tmp']) ).days
n
3
Additive landscape:
# initial flat whole-day landcape (height: 0)
L = np.zeros(24*60, dtype='int')
# add up ranges: (reused @sammywemmy's perfect formula for time of day in minutes)
for start, end in zip(df['Start_tmp'].dt.hour.mul(60) + df['Start_tmp'].dt.minute, # Start_tmp timestamps expressed in minutes
df['End_tmp'].dt.hour.mul(60) + df['End_tmp'].dt.minute): # End_tmp timestamps expressed in minutes
L[start:end+1] += 1
plt.plot(L)
plt.hlines(y=[2,3],xmin=0,xmax=1400,colors=['green','red'], linestyles='dashed')
plt.xlabel('time of day (minutes)')
plt.ylabel('time range overlaps')
(Please excuse the typo: these are obviously minutes, not seconds)
Keep only overlaps over all days:
# Reduce heights <n to 0 because not overlaping every day
L[L<n]=0
# Simplify all greater values to 1 because only their presence matters
L[L>0]=1
# Now only overlaps are highlighted
Extract overlap ranges and their lengths
# Highlight edges of range overlaps
D = np.diff(L)
# Describe overlaps as ranges
R = list(zip([a[0] for a in np.argwhere(D>0)], # indices where overlaps *begin*, with scalar indices instead of arrays
[a[0]-1 for a in np.argwhere(D<0)])) # indices where overlaps *end*, with scalar indices instead of arrays
R
[(75, 132), (1124, 1139)]
# Finally their lengths
[b-a for a,b in R]
Final output: [57, 15]
Answered By - OCa
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.