Issue
Given a dataframe as follows:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 4)), index=dates, columns=['A_values', 'B_values', 'C_values', 'target'])
Out:
A_values B_values C_values target
2013-02-26 6.059783 7.333694 1.389472 3.126731
2013-02-27 9.972433 1.281624 1.789931 7.529254
2013-02-28 6.621605 7.843101 0.968944 0.585713
2013-03-01 9.623960 6.165574 0.866300 5.612724
2013-03-02 6.165247 9.638430 5.743043 3.711608
... ... ... ...
2013-05-22 0.589729 6.479978 3.531450 6.872059
2013-05-23 6.279065 3.837670 8.853146 8.209883
2013-05-24 5.533017 5.241127 1.388056 5.355926
2013-05-25 1.596038 4.665995 2.406251 1.971875
2013-05-26 3.269001 1.787529 6.659690 7.545569
I need to resample by month and then calculate mape
between A_values
, B_values
, C_values
and target
column.
My final objective is to find two columns from A_values
, B_values
, C_values
which have smaller mape
values.
The expected result could be a new dataframe from a updated original dataframe by keeping values which have smaller mape
values but resetting values of other columns with NaN
s.
# function to calculate mape
def mape(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / np.maximum(np.ones(len(y_true)), np.abs(y_true))))*100
# Pseudocode to calculate mape between value columns and target column
A_error = df.resample('M').apply(mape(df['A_values'], df['target']))
B_error = df.resample('M').apply(mape(df['B_values'], df['target']))
C_error = df.resample('M').apply(mape(df['C_values'], df['target']))
At this step, we compare errors between A_values
, B_values
and C_values
for each year-month, ie., for Feb 2013, we find A_values
and B_values
column has smaller mape
values with target
column. So I will set the rest column C_values
's value by NaN
s.
The final expected result could be like this (not exact result, only show the format):
A_values B_values C_values target
2013-02-26 6.059783 7.333694 NaN 3.126731
2013-02-27 9.972433 1.281624 NaN 7.529254
2013-02-28 6.621605 7.843101 NaN 0.585713
2013-03-01 9.623960 NaN 0.866300 5.612724
2013-03-02 6.165247 NaN 5.743043 3.711608
... ... ... ...
2013-05-22 NaN 6.479978 3.531450 6.872059
2013-05-23 NaN 3.837670 8.853146 8.209883
2013-05-24 NaN 5.241127 1.388056 5.355926
2013-05-25 NaN 4.665995 2.406251 1.971875
2013-05-26 NaN 1.787529 6.659690 7.545569
How could I achieve that in Pandas? Thanks for your help at advance.
EDIT: if we implement for each whole column, with code below, we find C_values
and A_values
have smaller mape
, so I will replace values of B_values
with NaN
s.
mape(df['A_values'], df['target'])
Out[65]: 1.1619069176493515
mape(df['B_values'], df['target'])
Out[66]: 1.3477232830950627
mape(df['C_values'], df['target'])
Out[67]: 1.045038328819569
But since I hope to groupby year-month
(or resample by month) and then comprare them, thus I raise a question for a help.
Reference link:
Pandas Resample Apply Custom Function?
Solution
You can rewrite your mape
function so it works on pandas dataframe, then groupby().apply
, and finally merge_asof
to merge back:
# all your models
models = df.columns[:-1]
# models = df.columns[df.columns.str.endswith('_values')]
# function to calculate mape
def mape(y_true, y_pred):
y_pred = np.array(y_pred)
return np.mean(np.abs(y_true - y_pred) / np.clip(np.abs(y_true), 1, np.inf),
axis=0)*100
errors = (df.groupby(pd.Grouper(freq='M'))
.apply(lambda x: mape(x[models], x[['target']]))
)
aligned_mape = pd.merge_asof(df[['target']], errors,
left_index=True,
right_index=True,
direction='forward'
)
mask = (pd.get_dummies(aligned_mape[models].idxmax(axis=1))
.reindex(columns=models, fill_value=0).astype(bool)
)
df[models] = df[models].mask(mask)
Output:
A_values B_values C_values target
2013-02-26 6.059783 NaN 1.389472 3.126731
2013-02-27 9.972433 NaN 1.789931 7.529254
2013-02-28 6.621605 NaN 0.968944 0.585713
2013-03-01 9.623960 NaN 0.866300 5.612724
2013-03-02 6.165247 NaN 5.743043 3.711608
... ... ... ... ...
2013-05-22 NaN 6.479978 3.531450 6.872059
2013-05-23 NaN 3.837670 8.853146 8.209883
2013-05-24 NaN 5.241127 1.388056 5.355926
2013-05-25 NaN 4.665995 2.406251 1.971875
2013-05-26 NaN 1.787529 6.659690 7.545569
Answered By - Quang Hoang
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.