Monday, November 27, 2023

[FIXED] Discrepancy between best_score_ and highest mean_test_score in cv_results_ of HalvingGridSearchCV

November 27, 2023 gridsearchcv, python, scikit-learn, xgboost No comments

Issue

Problem

I'm using HalvingGridSearchCV in scikit-learn for hyperparameter tuning of an XGBoost model in an imbalanced-learn pipeline. I've noticed that the best_score_ (and consequently best_params_) does not align with the highest mean_test_score in cv_results_. This discrepancy is puzzling to me, especially since it can be substantial in some cases; I initially expected that the best_score_ would match the highest mean_test_score from cv_results_, as this score typically represents the best-performing model.

Is it reasonable to consider the model with the highest mean_test_score in cv_results_ as a valid choice for deployment instead of best_estimator_? What are the trade-offs to consider?

As I can't provide the full pipeline or data I'm sharing the relevant parts of the code:

from imblearn.ensemble import BalancedBaggingClassifier
from imblearn import FunctionSampler
from imblearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.model_selection import TimeSeriesSplit

# data
X, y = ...

model = Pipeline([
        ('sampling', FunctionSampler(
            ...
        )),
        ('classification', BalancedBaggingClassifier(
            base_estimator=XGBClassifier(
                            eval_metric='aucpr', use_label_encoder=False)
         ))
    ])

params = {
            'classification__base_estimator__max_depth': [3, 5, 7, 10],
            'classification__base_estimator__gamma': [0., 1e-4, 1e-2, 0.1, 1.]
        }

cv = TimeSeriesSplit(n_splits=5)
clf = HalvingGridSearchCV(
            estimator=model,
            param_grid=params,
            scoring='average_precision',
            factor=3,
            min_resources=2500,
            cv=cv,
            verbose=1,
            refit=True,
        )
clf.fit(X, y)

Results:

Best score: 0.3516, Best params: max_depth=10, gamma= 1.
Highest mean test score: 0.4006, corresponding params: max_depth=7, gamma= 0.1

The combination of parameters corresponding to the supposedly best score sometimes doesn't even appear in the top 5 highest mean test scores.

To sum up:

Why might the best_score_ and the highest mean_test_score in cv_results_ not match, and are there specific criteria or mechanisms in HalvingGridSearchCV or XGboost that explain this?
How should I interpret the discrepancy when selecting the best model for deployment? Considering the model with the highest mean_test_score from cv_results_ as an alternative to best_estimator_, what trade-offs should I consider?

Any insights would be greatly appreciated. While I can't share the actual data, I can provide additional code snippets or details for context. Thank you!

Solution

The best_score_ and associated parameters always correspond to the last iteration, i.e. the one with maximum resources. In the example in the User Guide, the same thing happens: earlier iterations actually have higher mean test score, but those are not taken into consideration for selecting the winner.

Broadly, you do expect the latter iterations to perform better because they have more resources. When the resource is rows (the default), the test folds are also subsampled, which might lead to this situation (the test scores are more noisy, so just be chance are sometimes higher). So I would be hesitant to select a different set of parameters based on earlier iterations.

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 27, 2023

[FIXED] Discrepancy between best_score_ and highest mean_test_score in cv_results_ of HalvingGridSearchCV

Issue

Problem

Results:

To sum up:

Solution

0 comments:

Post a Comment

Popular Posts

Labels