Issue
Imagine we have multiple time-series observations for multiple entities, and we want to perform hyper-parameter tuning on a single model, splitting the data in a time-series cross-validation fashion.
To my knowledge, there isn't a straightforward solution to performing this hyper-parameter tuning operation within the scikit-learn framework. There exists the functionality to do this with a single time-series using TimeSeriesSplit, however this doesn't work for multiple entities.
As a simple example imagine we have a dataframe:
from itertools import product
# create a dataframe
countries = ['ESP','FRA']
periods = list(range(10))
df = pd.DataFrame(list(product(countries,periods)), columns = ['country','period'])
df['target'] = np.concatenate((np.repeat(1, 10), np.repeat(0, 10)))
df['a_feature'] = np.random.randn(20, 1)
# this produces the following dataframe:
country,period,target,a_feature
ESP,0,1,0.08
ESP,1,1,-2.0
ESP,2,1,0.1
ESP,3,1,-0.59
ESP,4,1,-0.83
ESP,5,1,0.05
ESP,6,1,0.05
ESP,7,1,0.42
ESP,8,1,0.04
ESP,9,1,2.17
FRA,0,0,-0.44
FRA,1,0,-0.48
FRA,2,0,0.82
FRA,3,0,-1.64
FRA,4,0,0.19
FRA,5,0,0.6
FRA,6,0,-0.73
FRA,7,0,-0.5
FRA,8,0,1.11
FRA,9,0,-0.75
And we want to train a single model across Spain and France so that we take all the data up to a certain period, and then predict using that trained model the next period for both Spain and France. And we want to assess which set of hyper-parameters work best for performance.
How to do hyper-parameter tuning with panel data in an time-series cross-validation framework?
Similar questions have been asked here:
- Unbalanced Panel data: How to use Time Series Splits Cross-Validation?
- Random Forest hyper parameters tuning with panel data in python
- https://stats.stackexchange.com/questions/369397/correct-cross-validation-procedure-for-single-model-applied-to-panel-data
Solution
PanelSplit
I propose PanelSplit, a custom cross-validator for panel-data. It's essentially a wrapper for TimeSeriesSplit, taking similar same arguments as TimeSeriesSplit but allowing for panel-data functionality.
PanelSplit works essentially as follows:
- Create train and test indices for each fold by passing the period series to TimeSeriesSplit
- For the train and test sets of each fold, substitute the indices with the corresponding period values
- For each train and test periods of each fold, filter for the period values in the panel data's periods and return their indices.
import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
class PanelSplit:
def __init__(self, unique_periods, train_periods, n_splits = 5, gap = 0, test_size = None, max_train_size=None):
"""
A class for performing time series cross-validation with custom train/test splits based on unique periods.
Parameters:
- n_splits: Number of splits for TimeSeriesSplit
- gap: Gap between train and test sets in TimeSeriesSplit
- test_size: Size of the test set in TimeSeriesSplit
- unique_periods: Pandas DataFrame or Series containing unique periods
- train_periods: All available training periods
- max_train_size: Maximum size for a single training set.
"""
self.tss = TimeSeriesSplit(n_splits=n_splits, gap=gap, test_size=test_size, max_train_size = max_train_size)
indices = self.tss.split(unique_periods)
self.u_periods_cv = self.split_unique_periods(indices, unique_periods)
self.all_periods = train_periods
self.n_splits = n_splits
def split_unique_periods(self, indices, unique_periods):
"""
Split unique periods into train/test sets based on TimeSeriesSplit indices.
Parameters:
- indices: TimeSeriesSplit indices
- unique_periods: Pandas DataFrame or Series containing unique periods
Returns: List of tuples containing train and test periods
"""
u_periods_cv = []
for i, (train_index, test_index) in enumerate(indices):
unique_train_periods = unique_periods.iloc[train_index].values
unique_test_periods = unique_periods.iloc[test_index].values
u_periods_cv.append((unique_train_periods, unique_test_periods))
return u_periods_cv
def split(self, X = None, y = None, groups=None):
"""
Generate train/test indices based on unique periods.
"""
self.all_indices = []
for i, (train_periods, test_periods) in enumerate(self.u_periods_cv):
train_indices = self.all_periods.loc[self.all_periods.isin(train_periods)].index
test_indices = self.all_periods.loc[self.all_periods.isin(test_periods)].index
self.all_indices.append((train_indices, test_indices))
return self.all_indices
def get_n_splits(self, X=None, y =None, groups=None):
"""
Returns: Number of splits
"""
return self.n_splits
Hyper-parameter tuning with PanelSplit
Here is a demo of how it can be used as a cross-validator for hyperparameter tuning.
Before doing hyperparameter tuning in a real setting, I reset indices and drop NaN values with respect to both feature variables and the target. This usually saves me from indexing errors.
from itertools import product
# create a dataframe
countries = ['ESP','FRA']
periods = list(range(10))
df = pd.DataFrame(list(product(countries,periods)), columns=['country','period'])
df['target'] = np.concatenate((np.repeat(1, 10), np.repeat(0, 10)))
df['a_feature'] = np.random.randn(20, 1)
unique_periods = pd.Series(df.period.unique())
panel_split = PanelSplit(n_splits=3,
unique_periods= unique_periods, train_periods=df.period)
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {'max_depth': [2, 3]}
param_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=panel_split)
param_search.fit(df[['a_feature']], df['target'])
Answered By - Slash
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.