Friday, November 25, 2022

[FIXED] Can sklearn gridsearchcv rank candidates on median instead of mean_test_score?

November 25, 2022 gridsearchcv, scikit-learn No comments

Issue

Main question:

In a certain model scenario it seems more robust to me, to judge on candidates tested in a sklearn.model_selection.GridSearchCV based on their median performance instead of mean. Is there a way to do this?

Some more context:

Especially for small datasets or when using a CV scheme with low sample number in the test folds (e.g. LeaveOneOut), it may occur that certain folds achieve test scores that are extremely low, whereas the bulk of the folds might actually perform quite well. Selecting for the mean of all test scores, may result in a different candidate being preferred, for instance one where all folds have medium low, however, none have outrageously bad performance.

My current workaround has some problems: I can tell gridsearchcv to write the best_*-attributes w.r.t. a custom callable passed as the refit argument, so I am using the function below for selecting the model which achieved the best median score among the cv folds:

def best_median_score(cv_results):
    """
    Find the best median score from a cross-validation result dictionary.
    :param cv_results: dictionary of cross-validation results
    :return: index of best median score
    """

    inner_test_scores = np.array([
                                    scores for key, scores
                                    in cv_results.items()
                                    if key.startswith('split')
                                    and f'test_{Config.refit_scorer}'
                                    in key
                                ])
    median_inner_test_scores = np.median(inner_test_scores, axis=0)
    return median_inner_test_scores.argmax()

and pass it as:

grid = GridSearchCV(
    pipe,  # pipeline object of model steps
    params,  # parameter grid
    scoring= scorer,  # dict of multiple scorers
    refit= best_median_score,
    cv=10,
    verbose=1,
    n_jobs=-1
    )

However, gridsearchcv still calculates the mean_test_scores in the grid.cv_results_, where I would prefer "median_test_scores" instead. Also, this way I am loosing the grid.best_score_ attribute and get an error when trying to score manually:

grid.score(model_X, model_y)

KeyError Traceback (most recent call last)

File ~/.local/share/virtualenvs/my_env/lib/python3.9/site-packages/sklearn/model_selection/search.py:446, in BaseSearchCV.score(self, X, y) 444 if isinstance(self.scorer, dict): 445 if self.multimetric_: --> 446 scorer = self.scorer_[self.refit] 447 else: 448 scorer = self.scorer_

KeyError: <function best_median_score at 0x7f4b840beca0>

Solution

Median test performance could be calculated outside the GridSearchCV and then refitted with the best hyper-parameter combination based on median score.


import pandas as pd
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters, refit=False)
clf.fit(iris.data, iris.target)

results_df = pd.DataFrame(clf.cv_results_)
results_df['median_test_score'] = results_df.filter(regex='^split').median(axis=1)
results_df['rank_test_score'] = results_df['median_test_score'].rank(ascending=False).astype(int)


svc.set_params(**results_df.query('rank_test_score == 1')['params'].values[0])
svc.fit(iris.data, iris.target)

Answered By - Venkatachalam

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 25, 2022

[FIXED] Can sklearn gridsearchcv rank candidates on median instead of mean_test_score?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels