Issue
I have a dataset of daily transactions where there are multiple records per day. I need to split it into different cross validation folds to train a ML model, however I can't user TimeSeriesSplit from sklearn as there are multiple transactions per day. Do you know how can I do this in python?
Solution
Input data:
import numpy as np
import pandas as pd
data = np.array(
[['DAY_1','afds',5],
['DAY_1','rtws', 4],
['DAY_1','gtssd', 2],
['DAY_2','ititl', 4],
['DAY_2','uius', 7],
['DAY_3','hyaah', 6],
['DAY_4','apsaj', 9]])
df = pd.DataFrame(data,columns=['DATEDAY','TRANSACTION_ID','PRICE'])
Resulting df:
TRANSACTION_ID PRICE
DATEDAY
DAY_1 afds 5
DAY_1 rtws 4
DAY_1 gtssd 2
DAY_2 ititl 4
DAY_2 uius 7
DAY_3 hyaah 6
DAY_4 apsaj 9
Solution:
from sklearn.model_selection import TimeSeriesSplit
df = df.set_index('DATEDAY')
days = np.sort(df.index.unique())
tscv = TimeSeriesSplit(2)
for train_index, test_index in tscv.split(days):
print ('------------------------------')
train_days, test_days = days[train_index], days[test_index]
X_train, X_test = df.loc[train_days], df.loc[test_days]
print ('train:', X_train, '\n')
print ('test:', X_test, '\n')
Output:
------------------------------
train: TRANSACTION_ID PRICE
DATEDAY
DAY_1 afds 5
DAY_1 rtws 4
DAY_1 gtssd 2
DAY_2 ititl 4
DAY_2 uius 7
test: TRANSACTION_ID PRICE
DATEDAY
DAY_3 hyaah 6
------------------------------
train: TRANSACTION_ID PRICE
DATEDAY
DAY_1 afds 5
DAY_1 rtws 4
DAY_1 gtssd 2
DAY_2 ititl 4
DAY_2 uius 7
DAY_3 hyaah 6
test: TRANSACTION_ID PRICE
DATEDAY
DAY_4 apsaj 9
Note 1: we assume that the date column can be sorted. In this example,
DAY_X
doesn't sort well, since DAY_11 would be placed before DAY_2, for instance. If we only know the numberX
of the day, then we need to putX
in the column, instead ofDAY_X
, e.g., we might do something like:
df['DATEDAY'] = [int(x.split('_')[1]) for x in df['DATEDAY']]
Note 2: if we want to avoid having
DATEDAY
as index of the dataframe, we can simply reset the index forX_train
andX_test
:
for train_index, test_index in tscv.split(days):
print ('------------------------------')
train_days, test_days = days[train_index], days[test_index]
X_train, X_test = df.loc[train_days].reset_index(), df.loc[test_days].reset_index()
print ('train:\n', X_train, '\n')
print ('test:\n', X_test, '\n')
Output:
------------------------------
train:
DATEDAY TRANSACTION_ID PRICE
0 DAY_1 afds 5
1 DAY_1 rtws 4
2 DAY_1 gtssd 2
3 DAY_2 ititl 4
4 DAY_2 uius 7
test:
DATEDAY TRANSACTION_ID PRICE
0 DAY_3 hyaah 6
------------------------------
train:
DATEDAY TRANSACTION_ID PRICE
0 DAY_1 afds 5
1 DAY_1 rtws 4
2 DAY_1 gtssd 2
3 DAY_2 ititl 4
4 DAY_2 uius 7
5 DAY_3 hyaah 6
test:
DATEDAY TRANSACTION_ID PRICE
0 DAY_4 apsaj 9
Answered By - Jau A
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.