Friday, November 24, 2023

[FIXED] Using Sample Weights through metadata routing in scikit-learn in nested cross-validation

November 24, 2023 cross-validation, python, scikit-learn No comments

Issue

I am using the sklearn version "1.4.dev0" to weight samples in the fitting and scoring process as described in this post and in this documentation. https://scikit-learn.org/dev/metadata_routing.html sklearn GridSearchCV not using sample_weight in score function

I am trying to use this in a nested cross validation scheme, where hyperparmeter are tuned in a inner loop using "GridSearchCV" and performance is evaluated in the outer loop using "cross_validate". In both loops, the samples should be weighted for fitting and scoring.

I got confused because if I use or dont use sample_weights in the inner loop (thus, in GridSearchCV) seems not to have an effect on the results of crossvalidate, although the fitting time implicates that the two functions calls of cross_validate differ. Maybe I have mistaken something but for me this seems rather unexpected and not right. Here is a reproducable example. I would like to know

If my suggestion is right that the weighted cross_validate scores of the weighted and unweighted gridsearch estimator should differ
How could I implement it in a way that I get the expected difference in the cross_validate scores

# sklearn version is 1.4.dev0
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np

np.random.seed(42)
sklearn.set_config(enable_metadata_routing=True)

X, y = make_regression(n_samples=100, n_features=5, noise=0.5)
sample_weights = np.random.rand(len(y))
estimator = Lasso().set_fit_request(sample_weight=True)
hyperparameter_grid = {'alpha': [0.1, 0.5, 1.0, 2.0]}
scoring_inner_cv = 'neg_mean_squared_error'
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)

grid_search_weighted = GridSearchCV(estimator=estimator, param_grid=hyperparameter_grid, cv=inner_cv,
                             scoring=scoring_inner_cv)
grid_search_unweighted = GridSearchCV(estimator=estimator, param_grid=hyperparameter_grid, cv=inner_cv,
                             scoring=scoring_inner_cv)
grid_search_weighted.fit(X, y, sample_weight=sample_weights)
grid_search_unweighted.fit(X, y)

est_weighted = grid_search_weighted.best_estimator_
est_unweighted = grid_search_unweighted.best_estimator_

weighted_score = grid_search_weighted.best_score_
unweighted_score = grid_search_unweighted.best_score_

predictions_weighted = grid_search_weighted.best_estimator_.predict(X)[:5]  # these are differents depending on the use of sample weights
predictions_unweighted = grid_search_unweighted.best_estimator_.predict(X)[:5]

print('predictions weighted:', predictions_weighted)
print('predictions unweighted:', predictions_unweighted)
print('best grid search score weighted:', weighted_score)
print('best grid search score unweighted:', unweighted_score)


# Setting up outer cross-validation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=43)
scorers = {'mse': 'neg_mean_squared_error'}
results_weighted = cross_validate(est_weighted.set_score_request(sample_weight=True), 
                                  X, 
                                  y, 
                                  cv=outer_cv, 
                                  scoring=scorers, 
                                  return_estimator=True,
                                  params={"sample_weight": sample_weights})
results_unweighted = cross_validate(est_unweighted.set_score_request(sample_weight=True), 
                                    X, 
                                    y, 
                                    cv=outer_cv, 
                                    scoring=scorers, 
                                    return_estimator=True,
                                    params={"sample_weight": sample_weights})

print('cv fit time weighted:', results_weighted['fit_time'])
print('cv fit_time unweighted', results_unweighted['fit_time'])
print('cv score weighted:', results_weighted['test_mse'])
print('cv score unweighted:', results_unweighted['test_mse'])

Out:
predictions weighted: [ -56.75523055  -46.40853794 -257.61879983  115.33482089 -123.2799114 ]
predictions unweighted: [ -56.80695125  -46.46115926 -257.55129719  115.29365222 -123.17923488]
best grid search score weighted: -0.28206979708971763
best grid search score unweighted: -0.2959277881104643
cv fit time weighted: [0.00086832 0.00075293 0.00104165 0.00075936 0.000736  ]
cv fit_time unweighted [0.00077033 0.00074911 0.00076008 0.00075603 0.00073433]
cv score weighted: [-0.29977789 -0.19323401 -0.3599154  -0.29672299 -0.42656506]
cv score unweighted: [-0.29977789 -0.19323401 -0.3599154  -0.29672299 -0.42656506]

Edit: Sorry, still a bit sleepy, I corrected the code

Solution

cross_validate trains and scores the estimator, which means if the hyperparameters of the estimators are the same, then the trained versions inside cross_validate would also be the same, which is the case here since both est_weighted and est_unweighted use alpha=0.1.

There are a few issues here though, first, you're not using sample_weight in your scorer, which you should if you're using sample_weight. Second, for a nested cross validation, you should pass the GridSearchCV object to cross_validate. Here's the updated script:

import sklearn
from sklearn.metrics import get_scorer
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV, cross_validate, KFold
import numpy as np

np.random.seed(42)
sklearn.set_config(enable_metadata_routing=True)

X, y = make_regression(n_samples=100, n_features=5, noise=0.5)
sample_weights = np.random.rand(len(y))
estimator = Lasso().set_fit_request(sample_weight=True)
hyperparameter_grid = {"alpha": [0.1, 0.5, 1.0, 2.0]}
scoring_inner_cv = get_scorer("neg_mean_squared_error").set_score_request(
    sample_weight=True
)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)

grid_search_weighted = GridSearchCV(
    estimator=estimator,
    param_grid=hyperparameter_grid,
    cv=inner_cv,
    scoring=scoring_inner_cv,
)
grid_search_unweighted = GridSearchCV(
    estimator=estimator,
    param_grid=hyperparameter_grid,
    cv=inner_cv,
    scoring=scoring_inner_cv,
)
grid_search_weighted.fit(X, y, sample_weight=sample_weights)
grid_search_unweighted.fit(X, y)

est_weighted = grid_search_weighted.best_estimator_
est_unweighted = grid_search_unweighted.best_estimator_

print("best estimator weighted:", est_weighted)
print("best estimator unweighted:", est_unweighted)

weighted_score = grid_search_weighted.best_score_
unweighted_score = grid_search_unweighted.best_score_

predictions_weighted = grid_search_weighted.best_estimator_.predict(X)[
    :5
]  # these are different depending on the use of sample weights
predictions_unweighted = grid_search_unweighted.best_estimator_.predict(X)[:5]

print("predictions weighted:", predictions_weighted)
print("predictions unweighted:", predictions_unweighted)
print("best grid search score weighted:", weighted_score)
print("best grid search score unweighted:", unweighted_score)


# Setting up outer cross-validation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=43)
scorers = {
    "mse": get_scorer("neg_mean_squared_error").set_score_request(sample_weight=True)
}
results_weighted = cross_validate(
    grid_search_weighted,
    X,
    y,
    cv=outer_cv,
    scoring=scorers,
    return_estimator=True,
    params={"sample_weight": sample_weights},
)
results_unweighted = cross_validate(
    grid_search_unweighted,
    X,
    y,
    cv=outer_cv,
    scoring=scorers,
    return_estimator=True,
)

print("cv fit time weighted:", results_weighted["fit_time"])
print("cv fit_time unweighted", results_unweighted["fit_time"])
print("cv score weighted:", results_weighted["test_mse"])
print("cv score unweighted:", results_unweighted["test_mse"])

And the output:

best estimator weighted: Lasso(alpha=0.1)
best estimator unweighted: Lasso(alpha=0.1)
predictions weighted: [ -56.75523055  -46.40853794 -257.61879983  115.33482089 -123.2799114 ]
predictions unweighted: [ -56.80695125  -46.46115926 -257.55129719  115.29365222 -123.17923488]
best grid search score weighted: -0.30694013415226884
best grid search score unweighted: -0.2959277881104613
cv fit time weighted: [0.02880669 0.02938795 0.02891922 0.02823281 0.02768564]
cv fit_time unweighted [0.02526283 0.0255146  0.0250349  0.02500224 0.02558732]
cv score weighted: [-0.34250528 -0.21293099 -0.41301416 -0.36952952 -0.43474412]
cv score unweighted: [-0.28752003 -0.20898288 -0.40011525 -0.28467415 -0.41647231]

Answered By - adrin

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 24, 2023

[FIXED] Using Sample Weights through metadata routing in scikit-learn in nested cross-validation

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels