Sunday, December 24, 2023

[FIXED] Results not reproducible between runs despite seeds being set

December 24, 2023 python, random, random-seed, scikit-learn No comments

Issue

How is it possible, that running the same Python program twice with the exact same seeds and static data input produces different results? Calling the below function in a Jupyter Notebook yields the same results, however, when I restart the kernel, the results are different. The same applies when I run the code from the command line as a Python script. Is there anything else people do to make sure their code is reproducible? All resources I found talk about setting seeds. The randomness is introduced by ShapRFECV.

This code runs on a CPU only.

MWE (In this code I generate a dataset and eliminate features using ShapRFECV, if that's important):

import os, random
import numpy as np
import pandas as pd
from probatus.feature_elimination import ShapRFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

global_seed = 1234
os.environ['PYTHONHASHSEED'] = str(global_seed)
np.random.seed(global_seed)
random.seed(global_seed)

feature_names = ['f1', 'f2', 'f3_static', 'f4', 'f5', 'f6', 'f7',
 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 
'f18', 'f19', 'f20']

# Code from tutorial on probatus documentation
X, y = make_classification(n_samples=100, class_sep=0.05, n_informative=6, n_features=20, 
random_state=0, n_redundant=10, n_clusters_per_class=1)
X = pd.DataFrame(X, columns=feature_names)

def shap_feature_selection(X, y, seed: int) -> list[str]:
    
    random_forest = RandomForestClassifier(random_state=seed, n_estimators=70, max_features='log2',
criterion='entropy', class_weight='balanced')
    # Set to run on one thread only
    shap_elimination = ShapRFECV(clf=random_forest, step=0.2, cv=5,
scoring='f1_macro', n_jobs=1, random_state=seed)

    report = shap_elimination.fit_compute(X, y, check_additivity=True, seed=seed)
    # Return the set of features with the best validation accuracy
    return report.iloc[[report['val_metric_mean'].idxmax() - 1]]['features_set'].to_list()[0]

Results:

# Results from the first run
shap_feature_selection(X, y, 0)

>>> ['f17', 'f15', 'f18', 'f8', 'f12', 'f1', 'f13']

# Running again in same session
shap_feature_selection(X, y, 0)

>>> ['f17', 'f15', 'f18', 'f8', 'f12', 'f1', 'f13']

# Restarting the kernel and running the exact same command
shap_feature_selection(X, y, 0)
>>> ['f8', 'f1', 'f17', 'f6', 'f18', 'f20', 'f12', 'f15', 'f7', 'f13', 'f11']

Details:

Ubuntu 22.04
Python 3.9.12
Numpy 1.22.0
Sklearn 1.1.1

Solution

This has now been fixed in probatus (the issue was a bug, apparently connected to the pandas implementation they were using, see here). For me, everything works as expecting when using the probatus' latest code version (not the package).

Answered By - Dreana

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 24, 2023

[FIXED] Results not reproducible between runs despite seeds being set

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels