Friday, November 24, 2023

[FIXED] A recursive sample splitting scheme (with grid searching)

November 24, 2023 machine-learning, pandas, python, scikit-learn No comments

Issue

I have a data panel, there are several samples in each cross-section, for example

import pandas as pd
import numpy as np

dates = ["2018-01-01", "2019-01-01", "2020-01-01", "2021-01-01", "2022-01-01"] * 2
dates.sort()
samples = [1, 2] * 5
df = pd.DataFrame(
    {
        "dates": dates,
        "samples": samples
    }
)

I want to create a cross-validation generator, in which I do validation for 3 times:

The first time, samples in ["2018-01-01", "2019-01-01"] are the training samples, and in ["2020-01-01"] are the validation samples;
The second time, samples in ["2018-01-01", "2019-01-01", "2020-01-01"] are the training samples, and in ["2021-01-01"] are the validation samples;
The last time, samples in ["2018-01-01", "2019-01-01", "2020-01-01", "2021-01-01"] are the training samples, and in ["2022-01-01"] are the validation samples.

Briefly, the training set recursively increases while the validation set keeps a constant length.

I had thought about PredefinedSplit() function from sklearn.model_selection, but the problems are:

As you see, I didn't include all samples (either in the testing set or the validation set) each time;
["2020-01-01"] is in training set in the first and second time but not the first time validation.

This makes PredefinedSplit() powerless.

My question is: how to customise this splitting scheme? It is better to keep it in sklearn as I want to pass this splitting scheme into GridSearchCV() for grid searching?

Solution

This is essentially the purpose of TimeSeriesSplit (docs).

But if you want more control, the cv parameter of grid search and friends accepts

An iterable yielding (train, test) splits as arrays of indices.

so a list of pairs of lists of integers should work, e.g.

cv = [
    ([0, 1, 2, 3], [4, 5]),
    ([0, 1, 2, 3, 4, 5], [6, 7]),
    ...
]

(which you should be able to clean up with generator expressions and range, for larger examples).

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 24, 2023

[FIXED] A recursive sample splitting scheme (with grid searching)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels