Issue
I have a dataframe with multiple (numerical) features. It includes different participants ('pp' column) of an experiment, doing tasks in 2 different experimental conditions.
I aim to predict/classify the working condition ('condition' column, populated with 'N' and "S" for the 2 conditions) using the ML model with best classification accuracy
In order to do that I need to split test and train, stratified on 2 columns: 'pp' and 'condition'. So that my training set includes part of EACH of the participants. And then do Cross validation with the different (stratified) folds
The way I am currently doing it (credit to sklearn train_test_split on pandas stratify by multiple columns) is the following:
df['stratCol'] = df['pp_id'].astype(str) + "_" + df['condition'].astype(str)
X_train, X_test, y_train, y_test = train_test_split(df[feature_cols], df['condition'], test_size=0.20, stratify=df['stratCol'], shuffle=True, random_state=42)
However the problem arises when I need to do cross validation. I cannot seem to find a way to split in k-fold, stratified by a non target column..
I have looked into StratifiedShuffleSplit
, StratifiedKFold
of sklearn but they all seem to stratify on the label column by default, not allowing to define the stratification column myself. Also havent managed to find anything similar while looking in stack Overflow.
So how could I do Cross validation splitting train and test in a stratified way by participant AND condition? In other words, how to produce the k folds in such a stratified way?
Solution
This class is a custom workaround and provides StratifiedKFold
but with categorical column(s) additional to y. (let me know if you need it with numeric column(s) additional to y)
import numpy as np
from sklearn.model_selection import BaseCrossValidator
from sklearn.utils import check_random_state
class StratifiedMultiColumnKFold(BaseCrossValidator):
def __init__(self, df, additional_col_names, y, n_splits=3, shuffle=False, random_state=None):
self.df = df.copy()
self.additional_col_names = additional_col_names
self.y = y
self.n_splits = n_splits
self.shuffle = shuffle
self.random_state = random_state
def _iter_test_masks(self, X=None, y=None, groups=None):
n_samples = self.df.shape[0]
test_folds = np.zeros(n_samples, dtype=int)
# Concatenate the columns to form a single column for stratification
strata = self.df[self.additional_col_names].copy()
strata['y'] = self.y
X_strata = strata.apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
# Get the unique strata and their counts
unique_strata = np.unique(X_strata)
for stratum in unique_strata:
# Get the indices for this stratum
indices = np.where(X_strata == stratum)[0]
if self.shuffle:
check_random_state(self.random_state).shuffle(indices)
# Distribute indices among the folds
for fold_idx in range(self.n_splits):
size = len(indices) // self.n_splits + (fold_idx < len(indices) % self.n_splits)
test_folds[indices[:size]] = fold_idx
indices = indices[size:]
for i in range(self.n_splits):
yield test_folds == i
def get_n_splits(self, X=None, y=None, groups=None):
return self.n_splits
def split(self, X, y=None, groups=None):
indices = np.arange(self.df.shape[0])
for test_mask in self._iter_test_masks(X, y, groups):
yield indices[~test_mask], indices[test_mask]
I tested this class with the following code:
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import pandas as pd
# Load iris dataset as an example
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
# Standardize the features
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
# Select the columns to stratify by
additional_col_names = [X.columns.tolist()[0]] # use only the first feature column
# Initialize the cross-validator
cv = StratifiedMultiColumnKFold(X, additional_col_names, y, n_splits=3, shuffle=True, random_state=0)
# Define the parameter grid
param_grid = {'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01], 'kernel': ['linear', 'rbf']}
# Initialize the GridSearchCV object
grid = GridSearchCV(SVC(), param_grid, refit=True, cv=cv)
# Fit the GridSearchCV object
grid.fit(X, y)
# Print the best parameters
print(grid.best_params_)
Answered By - DataJanitor
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.