Tuesday, January 9, 2024

[FIXED] Pandas select rows that fall within 1 hour

January 09, 2024 datetime, pandas No comments

Issue

I have a pandas dataframe with 3 columns and the index is in datetime format. I want to find those rows where B-columns are identical AND their time is within 1 hour.

Explanation/example: In the below example, rows with identical B-col "s_1" and "s_2" are found. If we consider only "s_1" rows (only two in this eg), they are 1:45 apart. So, this pair is ignored. But, when we consider "s_2" rows, we see one pair of rows (corresponding to Ford and GMC) that are 30 minute apart (within 1 hour window). This pair must be reported as output. In this example, there is only 2. But, there could be more than 2 instances that occur within 1-hr window and all of them must be reported as output. There is one other row of belonging to 's_2' (that of Kia), but, it is more than 1 hr away from the most recent 's_2' (which is that of GMC) - not reported/output.

A minimum example is below:

import pandas as pd


def main():
    df = pd.DataFrame({'dttm_utc': pd.date_range('1/1/2012', periods=10, freq=pd.offsets.Minute(n=15))})
    df['A'] = [2365, 6721, 9835, 7651, 2398, 4555, 9881, 9080, 2010, 1999]
    df['B'] = ['s_1', 's_2', 's_3', 's_2', 's_4', 's_5', 's_6', 's_1', 's_7', 's_2']
    df['C'] = ['BMW', 'Ford', 'Toyota', 'GMC', 'Hyundai', 'Chevy', 'BMW', 'Honda', 'Tesla', 'Kia']
    df.set_index('dttm_utc', inplace=True)
    print(f'Initial DF:\n{df}')

    # Find rows with identical 'B' values
    dup_df = df[df.duplicated('B', keep=False) == True]
    print(f'\nDuplicated-B DF:\n{dup_df}')


if __name__ == "__main__":
    main()

The output looks like this:

❯ python3 select_rows.py
Initial DF:
                        A    B        C
dttm_utc                               
2012-01-01 00:00:00  2365  s_1      BMW
2012-01-01 00:15:00  6721  s_2     Ford
2012-01-01 00:30:00  9835  s_3   Toyota
2012-01-01 00:45:00  7651  s_2      GMC
2012-01-01 01:00:00  2398  s_4  Hyundai
2012-01-01 01:15:00  4555  s_5    Chevy
2012-01-01 01:30:00  9881  s_6      BMW
2012-01-01 01:45:00  9080  s_1    Honda
2012-01-01 02:00:00  2010  s_7    Tesla
2012-01-01 02:15:00  1999  s_2      Kia

Duplicated-B DF:
                        A    B      C
dttm_utc                             
2012-01-01 00:00:00  2365  s_1    BMW
2012-01-01 00:15:00  6721  s_2   Ford
2012-01-01 00:45:00  7651  s_2    GMC
2012-01-01 01:45:00  9080  s_1  Honda
2012-01-01 02:15:00  1999  s_2    Kia

Desired Output:

                        A    B      C
dttm_utc                             
2012-01-01 00:15:00  6721  s_2   Ford
2012-01-01 00:45:00  7651  s_2    GMC

Solution

Create a custom function to compute delta then shift the result to get the pair:

def within(sr, delta='1H'):
    diff = sr.diff()
    return (diff <= delta) | (diff.shift(-1) <= delta)
    # python >= 3.8 with walrus operator
    # return ((diff := sr.diff()) <= delta) | (diff.shift(-1) <= delta)

# extract values because index are not the same after reset_index
m = (df.reset_index().groupby('B')['dttm_utc']
       .transform(within, delta='1H').values

Output:

>>> df[m]
                        A    B     C
dttm_utc                            
2012-01-01 00:15:00  6721  s_2  Ford
2012-01-01 00:45:00  7651  s_2   GMC

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 9, 2024

[FIXED] Pandas select rows that fall within 1 hour

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels