Issue
Suppose I have a pipeline for my data which does preprocessing and has an estimator at the end. Now if I want to just change the estimator/model at the last step of the pipeline, how do I do it without preprocessing the same data all over again. Below is a code example
pipe = make_pipeline(
ColumnSelector(columns),
CategoricalEncoder(categories),
FunctionTransformer(pd.get_dummies, validate=False),
StandardScaler(scale),
LogisticRegression(),
)
Now I want to change the model to use Ridge or some other model than the LogisticRegression. How do I do this without doing the preprocessing all over again?
EDIT: Can I get my transformed data from a pipeline of the following sort
pipe = make_pipeline(
ColumnSelector(columns),
CategoricalEncoder(categories),
FunctionTransformer(pd.get_dummies, validate=False),
StandardScaler(scale)
)
Solution
For the case that you have computationally expensive transformers, you can use caching. As you didn't provide your transformer, here an extension of the sklearn example from the link, where two models are grid searched with a cached pipeline:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.externals.joblib import Memory
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
# Create a temporary folder to store the transformers of the pipeline
cachedir = mkdtemp()
memory = Memory(cachedir=cachedir, verbose=10)
# the pipeline
pipe = Pipeline([('reduce_dim', PCA()),
('classify', LinearSVC())],
memory=memory)
# models to try
param_grid = {"classify" : [LinearSVC(), ElasticNet()]}
# do the gridsearch on the models
grid = GridSearchCV(pipe, param_grid=param_grid)
digits = load_digits()
grid.fit(digits.data, digits.target)
# delete the temporary cache before exiting
rmtree(cachedir)
Edit:
As you focus on models in your question, and this question focusses on parameters, I wouldn't consider it an exact duplicate. However, the proposed solution there in combination with the param_grid set up as here would also be a good, maybe even better solution, depending on your exact problem.
Answered By - Marcus V.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.