Issue
This is followup question of this
I have two dataframes
:
print df_1
timestamp A B
0 2016-05-15 0.020228 0.026572
1 2016-05-15 0.057780 0.175499
2 2016-05-15 0.098808 0.620986
3 2016-05-17 0.158789 1.014819
4 2016-05-17 0.038129 2.384590
5 2018-05-17 0.011111 9.999999
print df_2
start end event
0 2016-05-14 2016-05-16 E1
1 2016-05-14 2016-05-16 E2
2 2016-05-17 2016-05-18 E3
I would like to merge df_1
and df_2
and get the event column
in df_1
if the timestamp
falls in between of start
and end
.
The problems, and the differences with this question, are
1) that event
s E1
and E2
have the same start
and end
.
2) Also in df_1
the 6th row does not fall inside any of the intervals.
In the end I would like to have both of the events and for the row that does not have any event have NA
.
So I would like my resulting dataframe
to be like this
timestamp A B event
0 2016-05-15 0.020228 0.026572 E1
1 2016-05-15 0.057780 0.175499 E1
2 2016-05-15 0.098808 0.620986 E1
3 2016-05-15 0.020228 0.026572 E2
4 2016-05-15 0.057780 0.175499 E2
5 2016-05-15 0.098808 0.620986 E2
6 2016-05-17 0.158789 1.014819 E3
7 2016-05-17 0.038129 2.384590 E3
8 2018-05-17 0.011111 9.999999 NA
Solution
import pandas as pd
df_1 = pd.DataFrame({'timestamp':['2016-05-15','2016-05-15','2016-05-15','2016-05-17','2016-05-17','2018-05-17'],
'A':[1,1,1,1,1,1]})
df_2 = pd.DataFrame({'start':['2016-05-14','2016-05-14','2016-05-17'],
'end':['2016-05-16','2016-05-16','2016-05-18'],
'event':['E1','E2','E3']})
df_1.timestamp = pd.to_datetime(df_1.timestamp, format='%Y-%m-%d')
df_2.start = pd.to_datetime(df_2.start, format='%Y-%m-%d')
df_2.end = pd.to_datetime(df_2.end, format='%Y-%m-%d')
# convert game_ref_dt to long format with all the dates in between, and do a left merge on date
df_2_2 = pd.melt(df_2, id_vars='event', value_name='timestamp')
df_2_2.timestamp = pd.to_datetime(df_2_2.timestamp)
df_2_2.set_index('timestamp', inplace=True)
df_2_2.drop('variable', axis=1, inplace=True)
df_2_3 = df_2_2.groupby('event').resample('D').ffill().reset_index(level=0, drop=True).reset_index()
df_2 = pd.merge(df_2, df_2_3)
df_2 = df_2.drop(columns=['start', 'end'])
df_1 = df_1.merge(df_2,on='timestamp', how='left')
print(df_1)
timestamp A event
0 2016-05-15 1 E1
1 2016-05-15 1 E2
2 2016-05-15 1 E1
3 2016-05-15 1 E2
4 2016-05-15 1 E1
5 2016-05-15 1 E2
6 2016-05-17 1 E3
7 2016-05-17 1 E3
8 2018-05-17 1 NaN
Credits to this
Also this solution, but does not give the NA
at the last row
import pandas as pd
df_1 = pd.DataFrame({'timestamp':['2016-05-15','2016-05-15','2016-05-15','2016-05-17','2016-05-17','2018-05-17'],
'A':[1,1,1,1,1,1]})
df_2 = pd.DataFrame({'start':['2016-05-14','2016-05-14','2016-05-17'],
'end':['2016-05-16','2016-05-16','2016-05-18'],
'event':['E1','E2','E3']})
df_try2 = pd.merge(df_1.assign(key=1), df_2.assign(key=1), on='key').query('timestamp >= start and timestamp <= end')
print(df_try2)
timestamp A key start end event
0 2016-05-15 1 1 2016-05-14 2016-05-16 E1
1 2016-05-15 1 1 2016-05-14 2016-05-16 E2
3 2016-05-15 1 1 2016-05-14 2016-05-16 E1
4 2016-05-15 1 1 2016-05-14 2016-05-16 E2
6 2016-05-15 1 1 2016-05-14 2016-05-16 E1
7 2016-05-15 1 1 2016-05-14 2016-05-16 E2
11 2016-05-17 1 1 2016-05-17 2016-05-18 E3
14 2016-05-17 1 1 2016-05-17 2016-05-18 E3
Answered By - quant
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.