Friday, April 8, 2022

[FIXED] Stratifying folds with StratifiedKFold in sklearn

April 08, 2022 scikit-learn No comments

Issue

I do not understand very well the logic behind sklearn function train_test_split and StratifiedKFold for obtaining balanced splits according to multiple "columns" and not only according to the target distribution. I know the previous sentence is a bit obscure so I hope the following code helps.

import numpy as np
import pandas as pd
import random

n_samples = 100
prob = 0.2
pos = int(n_samples * prob)
neg = n_samples - pos

target = [1] * pos + [0] * neg
cat = ["a"] * 50 + ["b"] * 50
random.shuffle(target)
random.shuffle(cat)

ds = pd.DataFrame()
ds["target"] = target
ds["cat"] = cat
ds["f1"] = np.random.random(size=(n_samples,))
ds["f2"] = np.random.random(size=(n_samples,))
print(ds.head())

This is a 100-example dataset, target distribution is governed by p, in this case we have 20% positive examples. There is a binary categorical column cat, perfectly balanced. The output of the previous code is:

     target cat        f1        f2
0       0   a  0.970585  0.134268
1       0   a  0.410689  0.225524
2       0   a  0.638111  0.273830
3       0   b  0.594726  0.579668
4       0   a  0.737440  0.667996

with train_test_split(), stratify on target and cat, if we study the frequencies, we get:

from sklearn.model_selection import train_test_split, StratifiedKFold

# with train_test_split
training, valid = train_test_split(range(n_samples), 
                test_size=20, 
                stratify=ds[["target", "cat"]])

print("---")
print("* training")
print(ds.loc[training, ["target", "cat"]].value_counts() / len(training))  # balanced
print("* validation")
print(ds.loc[valid, ["target", "cat"]].value_counts() / len(valid))  # balanced

we get this:

* dataset
0    0.8
1    0.2
Name: target, dtype: float64
target  cat
0       a      0.4
        b      0.4
1       a      0.1
        b      0.1
dtype: float64
---
* training
target  cat
0       a      0.4
        b      0.4
1       a      0.1
        b      0.1
dtype: float64
* validation
target  cat
0       a      0.4
        b      0.4
1       a      0.1
        b      0.1
dtype: float64

It is perfectly stratified.

Now with StratifiedKFold:

# with stratified k-fold
skf = StratifiedKFold(n_splits=5)
try:
    for train, valid in skf.split(X=range(len(ds)), y=ds[["target", "cat"]]):
        pass
except:
    print("! does not work")


for train, valid in skf.split(X=range(len(ds)), y=ds.target):
    print("happily iterating")

output:

! does not work
happily iterating
happily iterating
happily iterating
happily iterating
happily iterating

How do I obtain what I got with train_test_split with StratifiedKFold? I know there might be data distributions not allowing such stratifications in k-fold cross validation, but I cannot understand why train_test_split accepts two or more columns and the other method does not.

Solution

This doesn't seem readily possible currently.

Multilabel isn't exactly what you're looking for, but related. That's been asked here before, and was an Issue on sklearn's github (not sure why it got closed).

As a bit of a hack, you should be able to just combine your two columns into a new one with ordered pairs, and stratify on that?

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, April 8, 2022

[FIXED] Stratifying folds with StratifiedKFold in sklearn

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels