Issue
How is it possible, that running the same Python program twice with the exact same seeds and static data input produces different results? Calling the below function in a Jupyter Notebook yields the same results, however, when I restart the kernel, the results are different. The same applies when I run the code from the command line as a Python script. Is there anything else people do to make sure their code is reproducible? All resources I found talk about setting seeds. The randomness is introduced by ShapRFECV.
This code runs on a CPU only.
MWE (In this code I generate a dataset and eliminate features using ShapRFECV, if that's important):
import os, random
import numpy as np
import pandas as pd
from probatus.feature_elimination import ShapRFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
global_seed = 1234
os.environ['PYTHONHASHSEED'] = str(global_seed)
np.random.seed(global_seed)
random.seed(global_seed)
feature_names = ['f1', 'f2', 'f3_static', 'f4', 'f5', 'f6', 'f7',
'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17',
'f18', 'f19', 'f20']
# Code from tutorial on probatus documentation
X, y = make_classification(n_samples=100, class_sep=0.05, n_informative=6, n_features=20,
random_state=0, n_redundant=10, n_clusters_per_class=1)
X = pd.DataFrame(X, columns=feature_names)
def shap_feature_selection(X, y, seed: int) -> list[str]:
random_forest = RandomForestClassifier(random_state=seed, n_estimators=70, max_features='log2',
criterion='entropy', class_weight='balanced')
# Set to run on one thread only
shap_elimination = ShapRFECV(clf=random_forest, step=0.2, cv=5,
scoring='f1_macro', n_jobs=1, random_state=seed)
report = shap_elimination.fit_compute(X, y, check_additivity=True, seed=seed)
# Return the set of features with the best validation accuracy
return report.iloc[[report['val_metric_mean'].idxmax() - 1]]['features_set'].to_list()[0]
Results:
# Results from the first run
shap_feature_selection(X, y, 0)
>>> ['f17', 'f15', 'f18', 'f8', 'f12', 'f1', 'f13']
# Running again in same session
shap_feature_selection(X, y, 0)
>>> ['f17', 'f15', 'f18', 'f8', 'f12', 'f1', 'f13']
# Restarting the kernel and running the exact same command
shap_feature_selection(X, y, 0)
>>> ['f8', 'f1', 'f17', 'f6', 'f18', 'f20', 'f12', 'f15', 'f7', 'f13', 'f11']
Details:
- Ubuntu 22.04
- Python 3.9.12
- Numpy 1.22.0
- Sklearn 1.1.1
Solution
This has now been fixed in probatus (the issue was a bug, apparently connected to the pandas implementation they were using, see here). For me, everything works as expecting when using the probatus' latest code version (not the package).
Answered By - Dreana
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.