Issue
The following data frame is used as input:
import pandas as pd
import numpy as np
json_string = '{"datetime":{"0":1528955662000,"1":1528959255000,"2":1528965487000,"3":1528966204000,"4":1528966289000,"5":1528971637000,"6":1528974438000,"7":1528975251000,"8":1528982200000,"9":1528992569000,"10":1528994282000},"hit":{"0":1,"1":0,"2":0,"3":0,"4":0,"5":1,"6":1,"7":0,"8":1,"9":0,"10":1}}'
df = pd.read_json(json_string)
The exercise requires you to compute the mean of the hit
column for each moment in time (datetime
). However, the current observation should not be included in the mean. For instance, the first observation (index=0) gets np.NaN
since there are no observations apart from the one we're calculating the mean for. The second observation (index=1) gets 1 since 1/1 = 1 (0 from the second observation is not included). The third observation (index=2) gets 0.5 since (1+0)/2=0.5.
My code provides a correct answer (in terms of numbers) but is not elegant. I wonder whether you can complete the exercise with something different. Is it possible to use the pandas.api.indexers.VariableOffsetWindowIndexer
or pandas.api.indexers.BaseIndexer
and then get_window_bounds()
method?
My solution:
def add_hr(df):
"""
Generate a feature `mean_hr` which represents the average hit rate
at the moment of making the offer (`datetime`).
Parameters
----------
df : pandas.DataFrame
The `hit` column must be present. Ascending/descending order in the `datetime`
column is not assumed.
hit : int
datetime : string (format='%Y-%m-%d %H:%M:%S')
Returns
----------
df_expanded : pandas.DataFrame
A (deep) copy of the input pandas.DataFrame.
"""
df_expanded = df.copy(deep=True)
df_expanded.sort_values(by=['datetime'], ascending=True, inplace=True)
df_expanded['mean_hr'] = df_expanded['hit'].expanding().mean()
srs = df_expanded['mean_hr']
srs = srs[:len(srs)-1]
srs = pd.concat([pd.Series([np.nan]), srs])
df_expanded['mean_hr'] = srs.tolist()
return df_expanded
Full disclaimer: The exercise was a part of a recruitment process a month ago. The recruitment is now closed and I can't submit code anymore.
Solution
It seems that the problem can be solved by subclassing the BaseIndexer
class:
from pandas.api.indexers import BaseIndexer
class CustomIndexer(BaseIndexer):
def get_window_bounds(self, num_values, min_periods, center, closed, step):
end = np.arange(0, num_values, step, dtype='int64')
start = np.zeros(len(end), dtype='int64')
return start, end
indexer = CustomIndexer(window_size=0)
df_expanded = df.copy(deep=True)
df_expanded.hit = df_expanded.hit.rolling(indexer).mean()
Answered By - balkon16
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.