Issue
I am trying to determine which alpha
is the best in a Ridge Regression with scoring = 'neg_mean_squared_error'
.
I have an array with some values for alpha ranging from 5e09 to 5e-03:
array([5.00000000e+09, 3.78231664e+09, 2.86118383e+09, 2.16438064e+09,
1.63727458e+09, 1.23853818e+09, 9.36908711e+08, 7.08737081e+08,
5.36133611e+08, 4.05565415e+08, 3.06795364e+08, 2.32079442e+08,
1.75559587e+08, 1.32804389e+08, 1.00461650e+08, 7.59955541e+07,
5.74878498e+07, 4.34874501e+07, 3.28966612e+07, 2.48851178e+07,
1.88246790e+07, 1.42401793e+07, 1.07721735e+07, 8.14875417e+06,
6.16423370e+06, 4.66301673e+06, 3.52740116e+06, 2.66834962e+06,
2.01850863e+06, 1.52692775e+06, 1.15506485e+06, 8.73764200e+05,
6.60970574e+05, 5.00000000e+05, 3.78231664e+05, 2.86118383e+05,
2.16438064e+05, 1.63727458e+05, 1.23853818e+05, 9.36908711e+04,
7.08737081e+04, 5.36133611e+04, 4.05565415e+04, 3.06795364e+04,
2.32079442e+04, 1.75559587e+04, 1.32804389e+04, 1.00461650e+04,
7.59955541e+03, 5.74878498e+03, 4.34874501e+03, 3.28966612e+03,
2.48851178e+03, 1.88246790e+03, 1.42401793e+03, 1.07721735e+03,
8.14875417e+02, 6.16423370e+02, 4.66301673e+02, 3.52740116e+02,
2.66834962e+02, 2.01850863e+02, 1.52692775e+02, 1.15506485e+02,
8.73764200e+01, 6.60970574e+01, 5.00000000e+01, 3.78231664e+01,
2.86118383e+01, 2.16438064e+01, 1.63727458e+01, 1.23853818e+01,
9.36908711e+00, 7.08737081e+00, 5.36133611e+00, 4.05565415e+00,
3.06795364e+00, 2.32079442e+00, 1.75559587e+00, 1.32804389e+00,
1.00461650e+00, 7.59955541e-01, 5.74878498e-01, 4.34874501e-01,
3.28966612e-01, 2.48851178e-01, 1.88246790e-01, 1.42401793e-01,
1.07721735e-01, 8.14875417e-02, 6.16423370e-02, 4.66301673e-02,
3.52740116e-02, 2.66834962e-02, 2.01850863e-02, 1.52692775e-02,
1.15506485e-02, 8.73764200e-03, 6.60970574e-03, 5.00000000e-03])
Then, I used RidgeCV
to try and determine which of these values would be best:
ridgecv = RidgeCV(alphas = alphas, scoring = 'neg_mean_squared_error',
normalize = True, cv=KFold(10))
ridgecv.fit(X_train, y_train)
ridgecv.alpha_
and I got ridgecv.alpha_ = 0.006609705742330144
However, I received a warning that normalize = True
is deprecated and will be removed in version 1.2. The warning advised me to use Pipeline
and StandardScaler
instead. Then, following instructions of how to do a Pipeline
, I did:
steps = [
('scalar', StandardScaler(with_mean=False)),
('model',RidgeCV(alphas=alphas, scoring = 'neg_mean_squared_error', cv=KFold(10)))
]
ridge_pipe2 = Pipeline(steps)
ridge_pipe2.fit(X_train, y_train)
y_pred = ridge_pipe.predict(X_test)
ridge_pipe2.named_steps.model.alpha_
Doing this way, I got ridge_pipe2.named_steps.model.alpha_ = 1.328043891473342
For a last check, I also used GridSearchCV
as follows:
steps = [
('scalar', StandardScaler()),
('model',Ridge())
]
ridge_pipe = Pipeline(steps)
ridge_pipe.fit(X_train, y_train)
parameters = [{'model__alpha':alphas}]
grid_search = GridSearchCV(estimator = ridge_pipe,
param_grid = parameters,
scoring = 'neg_mean_squared_error',
cv = 10,
n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
grid_search.best_estimator_.get_params
where I got grid_search.best_estimator_.get_params = 1.328043891473342
(same as the other Pipeline
approach).
Therefore, my question... why normalizing my dataset with normalize=True
or with StandardScaler()
yields different best alpha
values?
Solution
The corresponding warning message for ordinary Ridge
makes an additional mention:
Set parameter alpha to: original_alpha * n_samples.
(I don't entirely understand why this is, but for now I'm willing to leave it. There should probably be a note added into the warning for RidgeCV
along these lines.) Changing your alphas
parameter in the second approach to [alph * X.shape[0] for alph in alphas]
should work. The selected alpha_
will be different, but rescaling again ridge_pipe2.named_steps.model.alpha_ / X.shape[0]
and I retrieve the same value as in the first approach (as well as the same rescaled coefficients).
(I've used the dataset shared in the linked question, and added the experiment to the notebook I created there.)
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.