Wednesday, January 10, 2024

[FIXED] Pandas - aggregate values with a variable-length rolling window

January 10, 2024 dataframe, numpy, pandas, python, rolling-computation No comments

Issue

The following data frame is used as input:

import pandas as pd
import numpy as np

json_string = '{"datetime":{"0":1528955662000,"1":1528959255000,"2":1528965487000,"3":1528966204000,"4":1528966289000,"5":1528971637000,"6":1528974438000,"7":1528975251000,"8":1528982200000,"9":1528992569000,"10":1528994282000},"hit":{"0":1,"1":0,"2":0,"3":0,"4":0,"5":1,"6":1,"7":0,"8":1,"9":0,"10":1}}'
df = pd.read_json(json_string)

The exercise requires you to compute the mean of the hit column for each moment in time (datetime). However, the current observation should not be included in the mean. For instance, the first observation (index=0) gets np.NaN since there are no observations apart from the one we're calculating the mean for. The second observation (index=1) gets 1 since 1/1 = 1 (0 from the second observation is not included). The third observation (index=2) gets 0.5 since (1+0)/2=0.5.

My code provides a correct answer (in terms of numbers) but is not elegant. I wonder whether you can complete the exercise with something different. Is it possible to use the pandas.api.indexers.VariableOffsetWindowIndexer or pandas.api.indexers.BaseIndexer and then get_window_bounds() method?

My solution:

def add_hr(df):
    """
    Generate a feature `mean_hr` which represents the average hit rate
    at the moment of making the offer (`datetime`).

    Parameters
    ----------
    df : pandas.DataFrame
        The `hit` column must be present. Ascending/descending order in the `datetime`
        column is not assumed.

        hit : int
        datetime : string (format='%Y-%m-%d %H:%M:%S')

    Returns
    ----------
    df_expanded : pandas.DataFrame
        A (deep) copy of the input pandas.DataFrame.
    """

    df_expanded = df.copy(deep=True)

    df_expanded.sort_values(by=['datetime'], ascending=True, inplace=True)

    df_expanded['mean_hr'] = df_expanded['hit'].expanding().mean()

    srs = df_expanded['mean_hr']

    srs = srs[:len(srs)-1]
    srs = pd.concat([pd.Series([np.nan]), srs])
    df_expanded['mean_hr'] = srs.tolist()

    return df_expanded

Full disclaimer: The exercise was a part of a recruitment process a month ago. The recruitment is now closed and I can't submit code anymore.

Solution

It seems that the problem can be solved by subclassing the BaseIndexer class:

from pandas.api.indexers import BaseIndexer

class CustomIndexer(BaseIndexer):
    
    def get_window_bounds(self, num_values, min_periods, center, closed, step):
        
        end = np.arange(0, num_values, step, dtype='int64')
        start = np.zeros(len(end), dtype='int64')
                
        return start, end  
    
indexer = CustomIndexer(window_size=0)

df_expanded = df.copy(deep=True)

df_expanded.hit = df_expanded.hit.rolling(indexer).mean()

Answered By - balkon16

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 10, 2024

[FIXED] Pandas - aggregate values with a variable-length rolling window

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels