Issue
I have a data panel, there are several samples in each cross-section, for example
import pandas as pd
import numpy as np
dates = ["2018-01-01", "2019-01-01", "2020-01-01", "2021-01-01", "2022-01-01"] * 2
dates.sort()
samples = [1, 2] * 5
df = pd.DataFrame(
{
"dates": dates,
"samples": samples
}
)
I want to create a cross-validation generator, in which I do validation for 3 times:
- The first time, samples in
["2018-01-01", "2019-01-01"]
are the training samples, and in["2020-01-01"]
are the validation samples; - The second time, samples in
["2018-01-01", "2019-01-01", "2020-01-01"]
are the training samples, and in["2021-01-01"]
are the validation samples; - The last time, samples in
["2018-01-01", "2019-01-01", "2020-01-01", "2021-01-01"]
are the training samples, and in["2022-01-01"]
are the validation samples.
Briefly, the training set recursively increases while the validation set keeps a constant length.
I had thought about PredefinedSplit()
function from sklearn.model_selection
, but the problems are:
- As you see, I didn't include all samples (either in the testing set or the validation set) each time;
["2020-01-01"]
is in training set in the first and second time but not the first time validation.
This makes PredefinedSplit()
powerless.
My question is: how to customise this splitting scheme? It is better to keep it in sklearn
as I want to pass this splitting scheme into GridSearchCV()
for grid searching?
Solution
This is essentially the purpose of TimeSeriesSplit
(docs).
But if you want more control, the cv
parameter of grid search and friends accepts
An iterable yielding (train, test) splits as arrays of indices.
so a list of pairs of lists of integers should work, e.g.
cv = [
([0, 1, 2, 3], [4, 5]),
([0, 1, 2, 3, 4, 5], [6, 7]),
...
]
(which you should be able to clean up with generator expressions and range
, for larger examples).
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.