Issue
This block of code:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
X = 'some_data'
y = 'some_target'
penalty = 1.5e-5
A = Ridge(normalize=True, alpha=penalty).fit(X, y)
triggers the following warning:
FutureWarning: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
- model = make_pipeline(StandardScaler(with_mean=False), Ridge())
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:
kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)
Set parameter alpha to: original_alpha * n_samples.
warnings.warn(
Ridge(alpha=1.5e-05)
But that codes gives me completely different coefficients, as expected because normalisation and standardisation are different.
B = make_pipeline(StandardScaler(with_mean=False), Ridge(alpha=penalty))
B[1].fit(B[0].fit_transform(X), y)
Output:
A.coefs[0], B[1].coefs[0]
(124.87330648168594, 125511.75051106009)
The result still does not match if I set alpha = penalty * n_features
.
Output:
A.coefs[0], B[1].coefs[0]
(124.87330648168594, 114686.09835548172)
even though Ridge()
uses a bit different normalization than I expected:
the regressor X will be normalized by subtracting mean and dividing by l2-norm
So what's the proper way to use ridge regression with normalization?
considering that l2-norm seems like being obtained after prediction, data modifying and fitting again
nothing comes to my mind in context of using ridge regression from sklearn, especially after 1.2 version
prepare data for experimenting:
url = 'https://drive.google.com/file/d/1bu64NqQkG0YR8G2CQPkxR1EQUAJ8kCZ6/view?usp=sharing'
url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]
data = pd.read_csv(url, index_col=0)
X = data.iloc[:,:15]
y = data['target']
Solution
The difference is that the coefficients reported with normalize=True
are to be applied directly to the unscaled inputs, whereas the pipeline approach applies its coefficients to the model's inputs, which are the scaled features.
You can "normalize" (an unfortunate overloading of the word) the coefficients by multiplying/dividing by the features' standard deviation. Together with the change to penalty suggested in the future warning, I get the same outputs:
np.allclose(A.coef_, B[1].coef_ / B[0].scale_)
# True
(I've tested using sklearn.datasets.load_diabetes
.)
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.