Issue
I'd like to find the best imputation method for missing data in Scikit-learn.
I have a dataset X
and I have created an artificially corrupted version of it in X_na
, so I can measure the qualities of different imputations. At this point I'm wondering if I could use sklearn's GridSearchCV
to do the search over possible imputer versions like this:
imputer_pipeline = Pipeline([("imputer":SimpleImputer())]
params = [{"imputer":[SimpleImputer()]},
{"imputer":[IterativeImputer()]},
{"imputer":[KNNImputer()], "imputer__n_neighbors": [3, 5, 7]}]
imputer_grid = GridSearchCV(imputer_pipe, param_grid=params, scoring="mse", cv=5)
imputer_grid.fit(X_na, X)
But the problem is that imputer_grid.fit
does'n channel X_na
and X
to the imputer pipeline, I cannot instruct it to compare the imputed X_na
and X
by scoring
(mse). The pipeline must have some object with .fit()
accepting both X
and y
.
Solution
Not all your imputers have a predict
method. You can create a custom function that simply returns the input, i.e return the imputed matrix that was passed, below is something I lifted over from DummyRegressor :
class IdentityFunction(MultiOutputMixin, RegressorMixin, BaseEstimator):
def __init__(self):
pass
def fit(self, X, y):
y = check_array(y, ensure_2d=False)
if len(y) == 0:
raise ValueError("y must not be empty.")
check_consistent_length(X, y)
return self
def predict(self, X):
return (X)
Then we define the pipeline using an example dataset:
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer
from sklearn.model_selection import GridSearchCV
import numpy as np
imputer_pipe = Pipeline([("imputer" , SimpleImputer()),
("identity", IdentityFunction())])
params = [{"imputer":[SimpleImputer()]},
{"imputer":[IterativeImputer()]},
{"imputer":[KNNImputer()], "imputer__n_neighbors": [3, 5, 7]}]
Use a dummy dataset and fit :
X = np.random.uniform(0,1,(100,3))
X_na = np.where(X<0.3,np.nan,X)
imputer_grid = GridSearchCV(imputer_pipe, param_grid=params,
scoring="neg_mean_squared_error", cv=5)
imputer_grid.fit(X_na, X)
The results, not useful here because there's no useful information in the dummy matrix to impute :
Pipeline(steps=[('imputer', IterativeImputer()),
('identity', IdentityFunction())])
Answered By - StupidWolf
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.