Friday, November 24, 2023

[FIXED] sklearn GridSearchCV not using sample_weight in score function

November 24, 2023 machine-learning, python, scikit-learn No comments

Issue

I have data with differing weights for each sample. In my application, it is important that these weights are accounted for in estimating the model and comparing alternative models.

I'm using sklearn to estimate models and to compare alternative hyperparameter choices. But this unit test shows that GridSearchCV does not apply sample_weights to estimate scores.

Is there a way to have sklearn use sample_weight to score the models?

Unit test:

from __future__ import division

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import GridSearchCV, RepeatedKFold


def grid_cv(X_in, y_in, w_in, cv, max_features_grid, use_weighting):
  out_results = dict()

  for k in max_features_grid:
    clf = RandomForestClassifier(n_estimators=256,
                                 criterion="entropy",
                                 warm_start=False,
                                 n_jobs=-1,
                                 random_state=RANDOM_STATE,
                                 max_features=k)
    for train_ndx, test_ndx in cv.split(X=X_in, y=y_in):
      X_train = X_in[train_ndx, :]
      y_train = y_in[train_ndx]
      w_train = w_in[train_ndx]
      y_test = y[test_ndx]

      clf.fit(X=X_train, y=y_train, sample_weight=w_train)

      y_hat = clf.predict_proba(X=X_in[test_ndx, :])
      if use_weighting:
        w_test = w_in[test_ndx]
        w_i_sum = w_test.sum()
        score = w_i_sum / w_in.sum() * log_loss(y_true=y_test, y_pred=y_hat, sample_weight=w_test)
      else:
        score = log_loss(y_true=y_test, y_pred=y_hat)

      results = out_results.get(k, [])
      results.append(score)
      out_results.update({k: results})

  for k, v in out_results.items():
    if use_weighting:
      mean_score = sum(v)
    else:
      mean_score = np.mean(v)
    out_results.update({k: mean_score})

  best_score = min(out_results.values())
  best_param = min(out_results, key=out_results.get)
  return best_score, best_param


if __name__ == "__main__":
  RANDOM_STATE = 1337
  X, y = load_iris(return_X_y=True)
  sample_weight = np.array([1 + 100 * (i % 25) for i in range(len(X))])
  # sample_weight = np.array([1 for _ in range(len(X))])

  inner_cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)

  outer_cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)

  rfc = RandomForestClassifier(n_estimators=256,
                               criterion="entropy",
                               warm_start=False,
                               n_jobs=-1,
                               random_state=RANDOM_STATE)
  search_params = {"max_features": [1, 2, 3, 4]}


  fit_params = {"sample_weight": sample_weight}
  my_scorer = make_scorer(log_loss, 
               greater_is_better=False, 
               needs_proba=True, 
               needs_threshold=False)

  grid_clf = GridSearchCV(estimator=rfc,
                          scoring=my_scorer,
                          cv=inner_cv,
                          param_grid=search_params,
                          refit=True,
                          return_train_score=False,
                          iid=False)  # in this usage, the results are the same for `iid=True` and `iid=False`
  grid_clf.fit(X, y, **fit_params)
  print("This is the best out-of-sample score using GridSearchCV: %.6f." % -grid_clf.best_score_)

  msg = """This is the best out-of-sample score %s weighting using grid_cv: %.6f."""
  score_with_weights, param_with_weights = grid_cv(X_in=X,
                                                   y_in=y,
                                                   w_in=sample_weight,
                                                   cv=inner_cv,
                                                   max_features_grid=search_params.get(
                                                     "max_features"),
                                                   use_weighting=True)
  print(msg % ("WITH", score_with_weights))

  score_without_weights, param_without_weights = grid_cv(X_in=X,
                                                         y_in=y,
                                                         w_in=sample_weight,
                                                         cv=inner_cv,
                                                         max_features_grid=search_params.get(
                                                           "max_features"),
                                                         use_weighting=False)
  print(msg % ("WITHOUT", score_without_weights))

Which produces output:

This is the best out-of-sample score using GridSearchCV: 0.135692.
This is the best out-of-sample score WITH weighting using grid_cv: 0.099367.
This is the best out-of-sample score WITHOUT weighting using grid_cv: 0.135692.

Explanation: Since manually computing the loss without weighting produces the same scoring as GridSearchCV, we know that the sample weights are not being used.

Solution

sklearn version >= 1.4 and 1.4 nightly releases

From sklearn 1.4 (release date somewhere around October 2023), and nightly releases already available from September 2023 which you can install following guides from here, you can use the new metadata routing mechanism.

You can simply decide which objects should or should not receive metadata such as sample_weight, like the following script:

import numpy as np
from sklearn import set_config
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import GridSearchCV, RepeatedKFold
from sklearn.metrics import make_scorer

set_config(enable_metadata_routing=True)


if __name__ == "__main__":
    RANDOM_STATE = 1337
    X, y = load_iris(return_X_y=True)
    sample_weight = np.array([1 + 100 * (i % 25) for i in range(len(X))])
    search_params = {"max_features": [1, 2, 3, 4]}
    cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)

    rfc_weighted = RandomForestClassifier(
        n_estimators=256,
        criterion="entropy",
        warm_start=False,
        n_jobs=1,
        random_state=RANDOM_STATE,
    ).set_fit_request(sample_weight=True)

    rfc_unweighted = RandomForestClassifier(
        n_estimators=256,
        criterion="entropy",
        warm_start=False,
        n_jobs=1,
        random_state=RANDOM_STATE,
    ).set_fit_request(sample_weight=False)

    scorer_weighted = make_scorer(
        log_loss, greater_is_better=False, needs_proba=True, needs_threshold=False
    ).set_score_request(sample_weight=True)

    scorer_unweighted = make_scorer(
        log_loss, greater_is_better=False, needs_proba=True, needs_threshold=False
    ).set_score_request(sample_weight=False)

    for rfc, rfc_is_weighted in zip([rfc_weighted, rfc_unweighted], [True, False]):
        for scorer, scorer_is_weighted in zip(
            [scorer_weighted, scorer_unweighted], [True, False]
        ):
            grid_clf = GridSearchCV(
                estimator=rfc,
                scoring=scorer,
                cv=cv,
                param_grid=search_params,
                refit=True,
                return_train_score=False,
            )
            if rfc_is_weighted or scorer_is_weighted:
                grid_clf.fit(X, y, sample_weight=sample_weight)
            else:
                grid_clf.fit(X, y)
            print(
                "This is the best out-of-sample score using GridSearchCV with "
                f"(is scorer weighted: {scorer_is_weighted}), (is rfc weighted: "
                f"{rfc_is_weighted}): {-grid_clf.best_score_}"
            )

which will output:

This is the best out-of-sample score using GridSearchCV with (is scorer weighted: True), (is rfc weighted: True): 0.09180030568650309
This is the best out-of-sample score using GridSearchCV with (is scorer weighted: False), (is rfc weighted: True): 0.1225297422810374
This is the best out-of-sample score using GridSearchCV with (is scorer weighted: True), (is rfc weighted: False): 0.09064253271691491
This is the best out-of-sample score using GridSearchCV with (is scorer weighted: False), (is rfc weighted: False): 0.12187958644498716

sklearn version < 1.4

The GridSearchCV takes a scoring as input, which can be callable. You can see the details of how to change the scoring function, and also how to pass your own scoring function here. Here's the relevant piece of code from that page for the sake of completeness:

EDIT: The fit_params is passed only to the fit functions, and not the score functions. If there are parameters which are supposed to be passed to the scorer, they should be passed to the make_scorer. But that still doesn't solve the issue here, since that would mean that the whole sample_weight parameter would be passed to log_loss, whereas only the part which corresponds to y_test at the time of calculating the loss should be passed.

sklearn does NOT support such a thing, but you can hack your way through, using a padas.DataFrame. The good news is, sklearn understands a DataFrame, and keeps it that way. Which means you can exploit the index of a DataFrame as you see in the code here:

  # more code

  X, y = load_iris(return_X_y=True)
  index = ['r%d' % x for x in range(len(y))]
  y_frame = pd.DataFrame(y, index=index)
  sample_weight = np.array([1 + 100 * (i % 25) for i in range(len(X))])
  sample_weight_frame = pd.DataFrame(sample_weight, index=index)

  # more code

  def score_f(y_true, y_pred, sample_weight):
      return log_loss(y_true.values, y_pred,
                      sample_weight=sample_weight.loc[y_true.index.values].values.reshape(-1),
                      normalize=True)

  score_params = {"sample_weight": sample_weight_frame}
  my_scorer = make_scorer(score_f,
                          greater_is_better=False, 
                          needs_proba=True, 
                          needs_threshold=False,
                          **score_params)

  grid_clf = GridSearchCV(estimator=rfc,
                          scoring=my_scorer,
                          cv=inner_cv,
                          param_grid=search_params,
                          refit=True,
                          return_train_score=False,
                          iid=False)  # in this usage, the results are the same for `iid=True` and `iid=False`
  grid_clf.fit(X, y_frame)
  
  # more code

As you see, the score_f uses the index of y_true to find which parts of sample_weight to use. For the sake of completeness, here's the whole code:

from __future__ import division

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import GridSearchCV, RepeatedKFold
from sklearn.metrics import  make_scorer
import pandas as pd

def grid_cv(X_in, y_in, w_in, cv, max_features_grid, use_weighting):
  out_results = dict()

  for k in max_features_grid:
    clf = RandomForestClassifier(n_estimators=256,
                                 criterion="entropy",
                                 warm_start=False,
                                 n_jobs=1,
                                 random_state=RANDOM_STATE,
                                 max_features=k)
    for train_ndx, test_ndx in cv.split(X=X_in, y=y_in):
      X_train = X_in[train_ndx, :]
      y_train = y_in[train_ndx]
      w_train = w_in[train_ndx]
      y_test = y_in[test_ndx]

      clf.fit(X=X_train, y=y_train, sample_weight=w_train)

      y_hat = clf.predict_proba(X=X_in[test_ndx, :])
      if use_weighting:
        w_test = w_in[test_ndx]
        w_i_sum = w_test.sum()
        score = w_i_sum / w_in.sum() * log_loss(y_true=y_test, y_pred=y_hat, sample_weight=w_test)
      else:
        score = log_loss(y_true=y_test, y_pred=y_hat)

      results = out_results.get(k, [])
      results.append(score)
      out_results.update({k: results})

  for k, v in out_results.items():
    if use_weighting:
      mean_score = sum(v)
    else:
      mean_score = np.mean(v)
    out_results.update({k: mean_score})

  best_score = min(out_results.values())
  best_param = min(out_results, key=out_results.get)
  return best_score, best_param


#if __name__ == "__main__":
if True:
  RANDOM_STATE = 1337
  X, y = load_iris(return_X_y=True)
  index = ['r%d' % x for x in range(len(y))]
  y_frame = pd.DataFrame(y, index=index)
  sample_weight = np.array([1 + 100 * (i % 25) for i in range(len(X))])
  sample_weight_frame = pd.DataFrame(sample_weight, index=index)
  # sample_weight = np.array([1 for _ in range(len(X))])

  inner_cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)

  outer_cv = RepeatedKFold(n_splits=3, n_repeats=1, random_state=RANDOM_STATE)

  rfc = RandomForestClassifier(n_estimators=256,
                               criterion="entropy",
                               warm_start=False,
                               n_jobs=1,
                               random_state=RANDOM_STATE)
  search_params = {"max_features": [1, 2, 3, 4]}


  def score_f(y_true, y_pred, sample_weight):
      return log_loss(y_true.values, y_pred,
                      sample_weight=sample_weight.loc[y_true.index.values].values.reshape(-1),
                      normalize=True)

  score_params = {"sample_weight": sample_weight_frame}
  my_scorer = make_scorer(score_f,
                          greater_is_better=False, 
                          needs_proba=True, 
                          needs_threshold=False,
                          **score_params)

  grid_clf = GridSearchCV(estimator=rfc,
                          scoring=my_scorer,
                          cv=inner_cv,
                          param_grid=search_params,
                          refit=True,
                          return_train_score=False,
                          iid=False)  # in this usage, the results are the same for `iid=True` and `iid=False`
  grid_clf.fit(X, y_frame)
  print("This is the best out-of-sample score using GridSearchCV: %.6f." % -grid_clf.best_score_)

  msg = """This is the best out-of-sample score %s weighting using grid_cv: %.6f."""
  score_with_weights, param_with_weights = grid_cv(X_in=X,
                                                   y_in=y,
                                                   w_in=sample_weight,
                                                   cv=inner_cv,
                                                   max_features_grid=search_params.get(
                                                     "max_features"),
                                                   use_weighting=True)
  print(msg % ("WITH", score_with_weights))

  score_without_weights, param_without_weights = grid_cv(X_in=X,
                                                         y_in=y,
                                                         w_in=sample_weight,
                                                         cv=inner_cv,
                                                         max_features_grid=search_params.get(
                                                           "max_features"),
                                                         use_weighting=False)
  print(msg % ("WITHOUT", score_without_weights))

The output of the code is then:

This is the best out-of-sample score using GridSearchCV: 0.095439.
This is the best out-of-sample score WITH weighting using grid_cv: 0.099367.
This is the best out-of-sample score WITHOUT weighting using grid_cv: 0.135692.

EDIT 2: as the comment bellow says:

the difference in my score and the sklearn score using this solution originates in the way that I was computing a weighted average of scores. If you omit the weighted average portion of the code, the two outputs match to machine precision.

Answered By - adrin

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 24, 2023

[FIXED] sklearn GridSearchCV not using sample_weight in score function

Issue

Solution

sklearn version >= 1.4 and 1.4 nightly releases

sklearn version < 1.4

0 comments:

Post a Comment

Popular Posts

Labels