Issue
could anyone explain the difference between a "normal" k-fold cross-validation using the shuffle function, e.g.
kf = KFold(n_splits = 5, shuffle = True)
and a repeated k-fold cross-validation? Shouldn't they return the same results?
Having a hard time understanding the difference.
Any hint is appreciated.
Solution
As its name says, RepeatedKFold
is a repeated KFold
.
It executes it n_repeats
times. When n_repeats=1
, the former performs exactly as the latter when shuffle=True
.
They do not return the same splits because random_state=None
by default, that is, you did not specify it.
Therefore, they use different seeds to (pseudo-)randomly shuffle data.
When they have the same random_state
and are repeated once, then both lead the same splits. For a deeper understanding try the following:
import pandas as pd
from sklearn.model_selection import KFold, RepeatedKFold
data = pd.DataFrame([['red', 'strawberry'], # color, fruit
['red', 'strawberry'],
['red', 'strawberry'],
['red', 'strawberry'],
['red', 'strawberry'],
['yellow', 'banana'],
['yellow', 'banana'],
['yellow', 'banana'],
['yellow', 'banana'],
['yellow', 'banana']])
X = data[0]
# KFold
for train_index, test_index in KFold(n_splits=2, shuffle=True, random_state=1).split(X):
print("TRAIN:", train_index, "TEST:", test_index)
# RepeatedKFold
for train_index, test_index in RepeatedKFold(n_splits=2, n_repeats=1, random_state=1).split(X):
print("TRAIN:", train_index, "TEST:", test_index)
You should obtain the following:
TRAIN: [1 3 5 7 8] TEST: [0 2 4 6 9]
TRAIN: [0 2 4 6 9] TEST: [1 3 5 7 8]
TRAIN: [1 3 5 7 8] TEST: [0 2 4 6 9]
TRAIN: [0 2 4 6 9] TEST: [1 3 5 7 8]
Answered By - s.dallapalma
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.