Thursday, February 1, 2024

[FIXED] Randomisation behaviour after cloning and fitting RandomizedSearchCV

February 01, 2024 cross-validation, random-seed, scikit-learn No comments

Issue

I have a basic nested CV loop, where an outer loop goes over an inner model-tuning step. My expectation is that each fold should draw a different random sample of hyperparameter values. However, in the example below, each fold ends up sampling the same values.

Imports and make dataset:

from sklearn.model_selection import RandomizedSearchCV, KFold, cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.base import clone

from scipy.stats import uniform
import numpy as np

X, y = make_classification(n_features=10, random_state=np.random.RandomState(0))

Nested CV loop:

#Used for tuning the random forest:
rf_tuner = RandomizedSearchCV(
    RandomForestClassifier(random_state=np.random.RandomState(0)),
    param_distributions=dict(min_samples_split=uniform(0.1, 0.9)),
    n_iter=5,
    cv=KFold(n_splits=2, shuffle=False),
    random_state=np.random.RandomState(0),
    n_jobs=1,
)

#Nested CV
for trn_idx, tst_idx in KFold(3).split(X, y):
    #'cloned' will now share the same RNG as 'rf_tuner'
    cloned = clone(rf_tuner)
    
    #This should be consuming the RNG of 'rf_tuner'
    cloned.fit(X[trn_idx], y[trn_idx])
    
    #Report hyperparameter values sampled in this fold
    display(cloned.cv_results_['params'])

    #<more code for nested CV, not shown>

Output:

Fold 1/3:
[{'min_samples_split': 0.593},
 {'min_samples_split': 0.743},
 {'min_samples_split': 0.642},
 {'min_samples_split': 0.590},
 {'min_samples_split': 0.481}]

Fold 2/3:
[{'min_samples_split': 0.593},
 {'min_samples_split': 0.743},
 {'min_samples_split': 0.642},
 {'min_samples_split': 0.590},
 {'min_samples_split': 0.481}]

Fold 3/3:
[{'min_samples_split': 0.593},
 {'min_samples_split': 0.743},
 {'min_samples_split': 0.642},
 {'min_samples_split': 0.590},
 {'min_samples_split': 0.481}]

I start by instantiating a RandomizedSearchCV with a RandomForestClassifier. I set the random_state= of the search to a random state instance np.random.RandomState(0).

For each pass of the outer loop, I clone() and fit() the search object - cloned should thus be using the same RNG as the original, mutating it at each pass. Each loop ought to yield a different sampling of hyperparameter values. However, as shown above, the hyperparameters sampled at each pass are identical. This suggests that each loop is starting with the same unmodified RNG rather than a mutated one.

The docs say that clones of estimators share the same random state instance:

b = clone(a) [...] calling a.fit will consume b’s RNG, and calling b.fit will consume a’s RNG, since they are the same

What explains the absence of randomisation between folds?

Solution

clone performs a deepcopy on each non-estimator parameter (source), and so in the case of a RandomState the clones will all have different RandomState objects all starting from the same state (in the sense of get_state()). So your example is expected.

I don't know offhand if this used to behave differently, or if the documentation has always been wrong in this point.

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, February 1, 2024

[FIXED] Randomisation behaviour after cloning and fitting RandomizedSearchCV

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels