Sunday, July 3, 2022

[FIXED] Sklearn Pipeline with multiple transforms and estimators

July 03, 2022 gridsearchcv, pipeline, scikit-learn No comments

Issue

I'm trying to build a GridSearchCV using Pipeline, and I want to test both transformers and estimators. Is there a more concise way of doing so?

pipeline = Pipeline([
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ('pca',  PCA()), 
    ('clf', KNeighborsClassifier())
])

parameters = [{
        'imputer': (SimpleImputer(), ), 
        'imputer__strategy': ('median', 'mean'),
        'pca__n_components': (10, 20), 
        'clf': (LogisticRegression(),),
        'clf__C': (1,10)
    }, {
        'imputer': (SimpleImputer(), ), 
        'imputer__strategy': ('median', 'mean'),
        'pca__n_components': (10, 20), 
        'clf': (KNeighborsClassifier(),),
        'clf__n_neighbors': (10, 25),
    }, {
        'imputer': (KNNImputer(), ), 
        'imputer__n_neighbors': (5, 10),
        'pca__n_components': (10, 20), 
        'clf': (LogisticRegression(),),
        'clf__C': (1,10)
    }, {
        'imputer': (KNNImputer(), ), 
        'imputer__n_neighbors': (5, 10),
        'pca__n_components': (10, 20), 
        'clf': (KNeighborsClassifier(),),
        'clf__n_neighbors': (10, 25),
    }]
grid_search = GridSearchCV(estimator=pipeline, param_grid=parameters)

Insted of having 4 blocks of parameters, I want to declare the 2 imputations methods that I want to test with their corresponding parameters, and the 2 classifiers. and without decalring the pca__n_components 4 times.

Solution

When you get hyperparameters that depend on each other a fair bit, the parameter grid approach gets to be cumbersome. There are a few ways to get what you need.

Nested grid searches

GridSearchCV(
    estimator=GridSearchCV(estimator=pipeline, param_grid=imputer_grid),
    param_grid=estimator_grid,
)

For each estimator candidate, this runs a grid search over the imputer candidates; the best imputer is used for the estimator, and then the estimators-with-best-imputers are compared.

The main drawback here is that the inner search gets cloned for each estimator candidate, and so you don't get access to the cv_results_ for the non-winning estimator's imputers.

pythonically generate (part of) the grid

ParameterGrid, used internally by GridSearchCV, is mostly a wrapper around itertools.product. So we can use itertools ourselves to create (chunks of) the grid. E.g. we can create the list you've written, but with less repeated code:

import itertools

imputers = [{
    'imputer': (SimpleImputer(), ), 
    'imputer__strategy': ('median', 'mean'),
},
{
    'imputer': (KNNImputer(), ), 
    'imputer__n_neighbors': (5, 10),
}]
models = [{
    'clf': (LogisticRegression(),),
    'clf__C': (1,10),
},
{
    'clf': (KNeighborsClassifier(),),
    'clf__n_neighbors': (10, 25),
}]
pcas = [{'pca__n_components': (10, 20),}]
parameters = [
    {**imp, **pca, **model}  # in py3.9 the slicker notation imp | pca | model works
    for imp, pca, model in itertools.product(imputers, pca, models)
]  # this should give the same as your hard-coded list-of-dicts

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, July 3, 2022

[FIXED] Sklearn Pipeline with multiple transforms and estimators

Issue

Solution

Nested grid searches

pythonically generate (part of) the grid

0 comments:

Post a Comment

Popular Posts

Labels