Issue
Main question:
In a certain model scenario it seems more robust to me, to judge on candidates tested in a sklearn.model_selection.GridSearchCV based on their median performance instead of mean. Is there a way to do this?
Some more context:
Especially for small datasets or when using a CV scheme with low sample number in the test folds (e.g. LeaveOneOut), it may occur that certain folds achieve test scores that are extremely low, whereas the bulk of the folds might actually perform quite well. Selecting for the mean of all test scores, may result in a different candidate being preferred, for instance one where all folds have medium low, however, none have outrageously bad performance.
My current workaround has some problems:
I can tell gridsearchcv
to write the best_*-attributes w.r.t. a custom callable passed as the refit
argument, so I am using the function below for selecting the model which achieved the best median score among the cv folds:
def best_median_score(cv_results):
"""
Find the best median score from a cross-validation result dictionary.
:param cv_results: dictionary of cross-validation results
:return: index of best median score
"""
inner_test_scores = np.array([
scores for key, scores
in cv_results.items()
if key.startswith('split')
and f'test_{Config.refit_scorer}'
in key
])
median_inner_test_scores = np.median(inner_test_scores, axis=0)
return median_inner_test_scores.argmax()
and pass it as:
grid = GridSearchCV(
pipe, # pipeline object of model steps
params, # parameter grid
scoring= scorer, # dict of multiple scorers
refit= best_median_score,
cv=10,
verbose=1,
n_jobs=-1
)
However, gridsearchcv
still calculates the mean_test_scores in the grid.cv_results_
, where I would prefer "median_test_scores" instead. Also, this way I am loosing the grid.best_score_
attribute and get an error when trying to score manually:
grid.score(model_X, model_y)
KeyError Traceback (most recent call last)
File ~/.local/share/virtualenvs/my_env/lib/python3.9/site-packages/sklearn/model_selection/search.py:446, in BaseSearchCV.score(self, X, y) 444 if isinstance(self.scorer, dict): 445 if self.multimetric_: --> 446 scorer = self.scorer_[self.refit] 447 else: 448 scorer = self.scorer_
KeyError: <function best_median_score at 0x7f4b840beca0>
Solution
Median test performance could be calculated outside the GridSearchCV and then refitted with the best hyper-parameter combination based on median score.
import pandas as pd
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, refit=False)
clf.fit(iris.data, iris.target)
results_df = pd.DataFrame(clf.cv_results_)
results_df['median_test_score'] = results_df.filter(regex='^split').median(axis=1)
results_df['rank_test_score'] = results_df['median_test_score'].rank(ascending=False).astype(int)
svc.set_params(**results_df.query('rank_test_score == 1')['params'].values[0])
svc.fit(iris.data, iris.target)
Answered By - Venkatachalam
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.