Issue
Problem
I'm using HalvingGridSearchCV
in scikit-learn for hyperparameter tuning of an XGBoost model in an imbalanced-learn
pipeline. I've noticed that the best_score_
(and consequently best_params_
) does not align with the highest mean_test_score
in cv_results_
. This discrepancy is puzzling to me, especially since it can be substantial in some cases; I initially expected that the best_score_
would match the highest mean_test_score from cv_results_
, as this score typically represents the best-performing model.
Is it reasonable to consider the model with the highest mean_test_score in cv_results_
as a valid choice for deployment instead of best_estimator_
? What are the trade-offs to consider?
As I can't provide the full pipeline or data I'm sharing the relevant parts of the code:
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn import FunctionSampler
from imblearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.model_selection import TimeSeriesSplit
# data
X, y = ...
model = Pipeline([
('sampling', FunctionSampler(
...
)),
('classification', BalancedBaggingClassifier(
base_estimator=XGBClassifier(
eval_metric='aucpr', use_label_encoder=False)
))
])
params = {
'classification__base_estimator__max_depth': [3, 5, 7, 10],
'classification__base_estimator__gamma': [0., 1e-4, 1e-2, 0.1, 1.]
}
cv = TimeSeriesSplit(n_splits=5)
clf = HalvingGridSearchCV(
estimator=model,
param_grid=params,
scoring='average_precision',
factor=3,
min_resources=2500,
cv=cv,
verbose=1,
refit=True,
)
clf.fit(X, y)
Results:
- Best score: 0.3516, Best params: max_depth=10, gamma= 1.
- Highest mean test score: 0.4006, corresponding params: max_depth=7, gamma= 0.1
The combination of parameters corresponding to the supposedly best score sometimes doesn't even appear in the top 5 highest mean test scores.
To sum up:
- Why might the
best_score_
and the highest mean_test_score incv_results_
not match, and are there specific criteria or mechanisms inHalvingGridSearchCV
or XGboost that explain this? - How should I interpret the discrepancy when selecting the best model for deployment?
Considering the model with the highest mean_test_score from
cv_results_
as an alternative tobest_estimator_
, what trade-offs should I consider?
Any insights would be greatly appreciated. While I can't share the actual data, I can provide additional code snippets or details for context. Thank you!
Solution
The best_score_
and associated parameters always correspond to the last iteration, i.e. the one with maximum resources. In the example in the User Guide, the same thing happens: earlier iterations actually have higher mean test score, but those are not taken into consideration for selecting the winner.
Broadly, you do expect the latter iterations to perform better because they have more resources. When the resource is rows (the default), the test folds are also subsampled, which might lead to this situation (the test scores are more noisy, so just be chance are sometimes higher). So I would be hesitant to select a different set of parameters based on earlier iterations.
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.