Issue
I have a pandas dataframe with 3 columns and the index is in datetime format. I want to find those rows where B-columns are identical AND their time is within 1 hour.
Explanation/example: In the below example, rows with identical B-col "s_1" and "s_2" are found. If we consider only "s_1" rows (only two in this eg), they are 1:45 apart. So, this pair is ignored. But, when we consider "s_2" rows, we see one pair of rows (corresponding to Ford and GMC) that are 30 minute apart (within 1 hour window). This pair must be reported as output. In this example, there is only 2. But, there could be more than 2 instances that occur within 1-hr window and all of them must be reported as output. There is one other row of belonging to 's_2' (that of Kia), but, it is more than 1 hr away from the most recent 's_2' (which is that of GMC) - not reported/output.
A minimum example is below:
import pandas as pd
def main():
df = pd.DataFrame({'dttm_utc': pd.date_range('1/1/2012', periods=10, freq=pd.offsets.Minute(n=15))})
df['A'] = [2365, 6721, 9835, 7651, 2398, 4555, 9881, 9080, 2010, 1999]
df['B'] = ['s_1', 's_2', 's_3', 's_2', 's_4', 's_5', 's_6', 's_1', 's_7', 's_2']
df['C'] = ['BMW', 'Ford', 'Toyota', 'GMC', 'Hyundai', 'Chevy', 'BMW', 'Honda', 'Tesla', 'Kia']
df.set_index('dttm_utc', inplace=True)
print(f'Initial DF:\n{df}')
# Find rows with identical 'B' values
dup_df = df[df.duplicated('B', keep=False) == True]
print(f'\nDuplicated-B DF:\n{dup_df}')
if __name__ == "__main__":
main()
The output looks like this:
❯ python3 select_rows.py
Initial DF:
A B C
dttm_utc
2012-01-01 00:00:00 2365 s_1 BMW
2012-01-01 00:15:00 6721 s_2 Ford
2012-01-01 00:30:00 9835 s_3 Toyota
2012-01-01 00:45:00 7651 s_2 GMC
2012-01-01 01:00:00 2398 s_4 Hyundai
2012-01-01 01:15:00 4555 s_5 Chevy
2012-01-01 01:30:00 9881 s_6 BMW
2012-01-01 01:45:00 9080 s_1 Honda
2012-01-01 02:00:00 2010 s_7 Tesla
2012-01-01 02:15:00 1999 s_2 Kia
Duplicated-B DF:
A B C
dttm_utc
2012-01-01 00:00:00 2365 s_1 BMW
2012-01-01 00:15:00 6721 s_2 Ford
2012-01-01 00:45:00 7651 s_2 GMC
2012-01-01 01:45:00 9080 s_1 Honda
2012-01-01 02:15:00 1999 s_2 Kia
Desired Output:
A B C
dttm_utc
2012-01-01 00:15:00 6721 s_2 Ford
2012-01-01 00:45:00 7651 s_2 GMC
Solution
Create a custom function to compute delta then shift the result to get the pair:
def within(sr, delta='1H'):
diff = sr.diff()
return (diff <= delta) | (diff.shift(-1) <= delta)
# python >= 3.8 with walrus operator
# return ((diff := sr.diff()) <= delta) | (diff.shift(-1) <= delta)
# extract values because index are not the same after reset_index
m = (df.reset_index().groupby('B')['dttm_utc']
.transform(within, delta='1H').values
Output:
>>> df[m]
A B C
dttm_utc
2012-01-01 00:15:00 6721 s_2 Ford
2012-01-01 00:45:00 7651 s_2 GMC
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.