Tuesday, October 4, 2022

[FIXED] How to reproduce the behaviour of Ridge(normalize=True)?

October 04, 2022 machine-learning, python, scikit-learn, statistics No comments

Issue

This block of code:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

X = 'some_data'
y = 'some_target'

penalty = 1.5e-5
A = Ridge(normalize=True, alpha=penalty).fit(X, y)

triggers the following warning:

FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

 - model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * n_samples. 
  warnings.warn(
Ridge(alpha=1.5e-05)

But that codes gives me completely different coefficients, as expected because normalisation and standardisation are different.

B = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=penalty))
B[1].fit(B[0].fit_transform(X), y)

Output:

A.coefs[0], B[1].coefs[0]
(124.87330648168594, 125511.75051106009)

The result still does not match if I set alpha = penalty * n_features.

Output:

A.coefs[0], B[1].coefs[0]
(124.87330648168594, 114686.09835548172)

even though Ridge() uses a bit different normalization than I expected:

the regressor X will be normalized by subtracting mean and dividing by l2-norm

So what's the proper way to use ridge regression with normalization?
considering that l2-norm seems like being obtained after prediction, data modifying and fitting again
nothing comes to my mind in context of using ridge regression from sklearn, especially after 1.2 version

prepare data for experimenting:

url = 'https://drive.google.com/file/d/1bu64NqQkG0YR8G2CQPkxR1EQUAJ8kCZ6/view?usp=sharing'
url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]
data = pd.read_csv(url, index_col=0)

X = data.iloc[:,:15]
y = data['target']

Solution

The difference is that the coefficients reported with normalize=True are to be applied directly to the unscaled inputs, whereas the pipeline approach applies its coefficients to the model's inputs, which are the scaled features.

You can "normalize" (an unfortunate overloading of the word) the coefficients by multiplying/dividing by the features' standard deviation. Together with the change to penalty suggested in the future warning, I get the same outputs:

np.allclose(A.coef_, B[1].coef_ / B[0].scale_)
# True

(I've tested using sklearn.datasets.load_diabetes.)

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 4, 2022

[FIXED] How to reproduce the behaviour of Ridge(normalize=True)?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels