Friday, October 28, 2022

[FIXED] Split time series with multiple records per day

October 28, 2022 cross-validation, machine-learning, python, scikit-learn, time-series No comments

Issue

I have a dataset of daily transactions where there are multiple records per day. I need to split it into different cross validation folds to train a ML model, however I can't user TimeSeriesSplit from sklearn as there are multiple transactions per day. Do you know how can I do this in python?

Solution

Input data:

import numpy as np
import pandas as pd
data = np.array(
    [['DAY_1','afds',5],
     ['DAY_1','rtws', 4],
     ['DAY_1','gtssd', 2],
     ['DAY_2','ititl', 4],
     ['DAY_2','uius', 7],
     ['DAY_3','hyaah', 6],
     ['DAY_4','apsaj', 9]])
df = pd.DataFrame(data,columns=['DATEDAY','TRANSACTION_ID','PRICE'])

Resulting df:

        TRANSACTION_ID PRICE
DATEDAY
DAY_1             afds     5
DAY_1             rtws     4
DAY_1            gtssd     2
DAY_2            ititl     4
DAY_2             uius     7
DAY_3            hyaah     6
DAY_4            apsaj     9

Solution:

from sklearn.model_selection import TimeSeriesSplit

df = df.set_index('DATEDAY')
days = np.sort(df.index.unique())
tscv = TimeSeriesSplit(2)
for train_index, test_index in tscv.split(days):
    print ('------------------------------')
    train_days, test_days = days[train_index], days[test_index]
    X_train, X_test = df.loc[train_days], df.loc[test_days]
    print ('train:', X_train, '\n')
    print ('test:', X_test, '\n')

Output:

------------------------------
train:         TRANSACTION_ID PRICE
DATEDAY
DAY_1             afds     5
DAY_1             rtws     4
DAY_1            gtssd     2
DAY_2            ititl     4
DAY_2             uius     7

test:         TRANSACTION_ID PRICE
DATEDAY
DAY_3            hyaah     6

------------------------------
train:         TRANSACTION_ID PRICE
DATEDAY
DAY_1             afds     5
DAY_1             rtws     4
DAY_1            gtssd     2
DAY_2            ititl     4
DAY_2             uius     7
DAY_3            hyaah     6

test:         TRANSACTION_ID PRICE
DATEDAY
DAY_4            apsaj     9

Note 1: we assume that the date column can be sorted. In this example, DAY_X doesn't sort well, since DAY_11 would be placed before DAY_2, for instance. If we only know the number X of the day, then we need to put X in the column, instead of DAY_X, e.g., we might do something like:

df['DATEDAY'] = [int(x.split('_')[1]) for x in df['DATEDAY']]

Note 2: if we want to avoid having DATEDAY as index of the dataframe, we can simply reset the index for X_train and X_test:

for train_index, test_index in tscv.split(days):
    print ('------------------------------')
    train_days, test_days = days[train_index], days[test_index]
    X_train, X_test = df.loc[train_days].reset_index(), df.loc[test_days].reset_index()
    print ('train:\n', X_train, '\n')
    print ('test:\n', X_test, '\n')

Output:

------------------------------
train:
   DATEDAY TRANSACTION_ID PRICE
0   DAY_1           afds     5
1   DAY_1           rtws     4
2   DAY_1          gtssd     2
3   DAY_2          ititl     4
4   DAY_2           uius     7

test:
   DATEDAY TRANSACTION_ID PRICE
0   DAY_3          hyaah     6

------------------------------
train:
   DATEDAY TRANSACTION_ID PRICE
0   DAY_1           afds     5
1   DAY_1           rtws     4
2   DAY_1          gtssd     2
3   DAY_2          ititl     4
4   DAY_2           uius     7
5   DAY_3          hyaah     6

test:
   DATEDAY TRANSACTION_ID PRICE
0   DAY_4          apsaj     9

Answered By - Jau A

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, October 28, 2022

[FIXED] Split time series with multiple records per day

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels