Friday, December 22, 2023

[FIXED] how to split k fold test and train (for Cross Validation), when splitting stratified by 2 columns (target and one more) for each fold?

December 22, 2023 python, scikit-learn No comments

Issue

I have a dataframe with multiple (numerical) features. It includes different participants ('pp' column) of an experiment, doing tasks in 2 different experimental conditions.

I aim to predict/classify the working condition ('condition' column, populated with 'N' and "S" for the 2 conditions) using the ML model with best classification accuracy

In order to do that I need to split test and train, stratified on 2 columns: 'pp' and 'condition'. So that my training set includes part of EACH of the participants. And then do Cross validation with the different (stratified) folds

The way I am currently doing it (credit to sklearn train_test_split on pandas stratify by multiple columns) is the following:

df['stratCol'] = df['pp_id'].astype(str) + "_" + df['condition'].astype(str)

X_train, X_test, y_train, y_test = train_test_split(df[feature_cols], df['condition'], test_size=0.20, stratify=df['stratCol'], shuffle=True, random_state=42)

However the problem arises when I need to do cross validation. I cannot seem to find a way to split in k-fold, stratified by a non target column..

I have looked into StratifiedShuffleSplit , StratifiedKFold of sklearn but they all seem to stratify on the label column by default, not allowing to define the stratification column myself. Also havent managed to find anything similar while looking in stack Overflow.

So how could I do Cross validation splitting train and test in a stratified way by participant AND condition? In other words, how to produce the k folds in such a stratified way?

Solution

This class is a custom workaround and provides StratifiedKFold but with categorical column(s) additional to y. (let me know if you need it with numeric column(s) additional to y)

import numpy as np
from sklearn.model_selection import BaseCrossValidator
from sklearn.utils import check_random_state

class StratifiedMultiColumnKFold(BaseCrossValidator):
    def __init__(self, df, additional_col_names, y, n_splits=3, shuffle=False, random_state=None):
        self.df = df.copy()
        self.additional_col_names = additional_col_names
        self.y = y
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state

    def _iter_test_masks(self, X=None, y=None, groups=None):
        n_samples = self.df.shape[0]
        test_folds = np.zeros(n_samples, dtype=int)

        # Concatenate the columns to form a single column for stratification
        strata = self.df[self.additional_col_names].copy()
        strata['y'] = self.y
        X_strata = strata.apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

        # Get the unique strata and their counts
        unique_strata = np.unique(X_strata)

        for stratum in unique_strata:
            # Get the indices for this stratum
            indices = np.where(X_strata == stratum)[0]

            if self.shuffle:
                check_random_state(self.random_state).shuffle(indices)

            # Distribute indices among the folds
            for fold_idx in range(self.n_splits):
                size = len(indices) // self.n_splits + (fold_idx < len(indices) % self.n_splits)
                test_folds[indices[:size]] = fold_idx
                indices = indices[size:]

        for i in range(self.n_splits):
            yield test_folds == i

    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_splits

    def split(self, X, y=None, groups=None):
        indices = np.arange(self.df.shape[0])
        for test_mask in self._iter_test_masks(X, y, groups):
            yield indices[~test_mask], indices[test_mask]

I tested this class with the following code:

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import pandas as pd

# Load iris dataset as an example
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Standardize the features
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

# Select the columns to stratify by
additional_col_names = [X.columns.tolist()[0]]  # use only the first feature column

# Initialize the cross-validator
cv = StratifiedMultiColumnKFold(X, additional_col_names, y, n_splits=3, shuffle=True, random_state=0)

# Define the parameter grid
param_grid = {'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01], 'kernel': ['linear', 'rbf']}

# Initialize the GridSearchCV object
grid = GridSearchCV(SVC(), param_grid, refit=True, cv=cv)

# Fit the GridSearchCV object
grid.fit(X, y)

# Print the best parameters
print(grid.best_params_)

Answered By - DataJanitor

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 22, 2023

[FIXED] how to split k fold test and train (for Cross Validation), when splitting stratified by 2 columns (target and one more) for each fold?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels