Issue
I'm trying to build a GridSearchCV using Pipeline, and I want to test both transformers and estimators. Is there a more concise way of doing so?
pipeline = Pipeline([
('imputer', SimpleImputer()),
('scaler', StandardScaler()),
('pca', PCA()),
('clf', KNeighborsClassifier())
])
parameters = [{
'imputer': (SimpleImputer(), ),
'imputer__strategy': ('median', 'mean'),
'pca__n_components': (10, 20),
'clf': (LogisticRegression(),),
'clf__C': (1,10)
}, {
'imputer': (SimpleImputer(), ),
'imputer__strategy': ('median', 'mean'),
'pca__n_components': (10, 20),
'clf': (KNeighborsClassifier(),),
'clf__n_neighbors': (10, 25),
}, {
'imputer': (KNNImputer(), ),
'imputer__n_neighbors': (5, 10),
'pca__n_components': (10, 20),
'clf': (LogisticRegression(),),
'clf__C': (1,10)
}, {
'imputer': (KNNImputer(), ),
'imputer__n_neighbors': (5, 10),
'pca__n_components': (10, 20),
'clf': (KNeighborsClassifier(),),
'clf__n_neighbors': (10, 25),
}]
grid_search = GridSearchCV(estimator=pipeline, param_grid=parameters)
Insted of having 4 blocks of parameters, I want to declare the 2 imputations methods that I want to test with their corresponding parameters, and the 2 classifiers. and without decalring the pca__n_components 4 times.
Solution
When you get hyperparameters that depend on each other a fair bit, the parameter grid approach gets to be cumbersome. There are a few ways to get what you need.
Nested grid searches
GridSearchCV(
estimator=GridSearchCV(estimator=pipeline, param_grid=imputer_grid),
param_grid=estimator_grid,
)
For each estimator candidate, this runs a grid search over the imputer candidates; the best imputer is used for the estimator, and then the estimators-with-best-imputers are compared.
The main drawback here is that the inner search gets cloned for each estimator candidate, and so you don't get access to the cv_results_
for the non-winning estimator's imputers.
pythonically generate (part of) the grid
ParameterGrid
, used internally by GridSearchCV
, is mostly a wrapper around itertools.product
. So we can use itertools
ourselves to create (chunks of) the grid. E.g. we can create the list you've written, but with less repeated code:
import itertools
imputers = [{
'imputer': (SimpleImputer(), ),
'imputer__strategy': ('median', 'mean'),
},
{
'imputer': (KNNImputer(), ),
'imputer__n_neighbors': (5, 10),
}]
models = [{
'clf': (LogisticRegression(),),
'clf__C': (1,10),
},
{
'clf': (KNeighborsClassifier(),),
'clf__n_neighbors': (10, 25),
}]
pcas = [{'pca__n_components': (10, 20),}]
parameters = [
{**imp, **pca, **model} # in py3.9 the slicker notation imp | pca | model works
for imp, pca, model in itertools.product(imputers, pca, models)
] # this should give the same as your hard-coded list-of-dicts
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.