Issue
I have the following (hypothetical) dataset:
numerical_ok | numerical_missing | categorical |
---|---|---|
210 | 30 | cat1 |
180 | NaN | cat2 |
70 | 19 | cat3 |
Where categorical is a string and numerical_ok and numerical_missing are both numerical columns but the last one has some missing points. I want to perform these three tasks:
- OneHotEncode categorical column
- Impute NAs in numerical_missing using SimpleInputer or another imputer presented in sklearn
- Apply KBinsDiscretizer in numerical_missing
Of course this is quite easy to do if I use a mixed pandas/sklearn approach:
df["numerical_missing"] = SimpleImputer().fit_transform(df[["numerical_missing"]])
ColumnTransformer([
("encoder", OneHotEncoder(), ["categorical"]),
("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)
But for reasons of scalibility and consistency (I'll later fit a model to this data) I'd like to see how this could work using pipelines. I tried two ways to do that:
WAY 1: Using a single ColumnTransformer.
But since it seems to execute jointly all the steps, KBinsDiscretizer runs with still missing data:
ColumnTransformer([
("imputer", SimpleImputer(), ["numerical_missing"]),
("encoder", OneHotEncoder(), ["categorical"]),
("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)
Giving this error:
KBinsDiscretizer does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
WAY 2: Composing two ColumnTransformers into a Pipeline
Now the product of the first Pipeline is a sparse array where I cannot access the original feature names.
Pipeline([
("transformer", (
ColumnTransformer([
("imputer", SimpleImputer(), ["numerical_missing"]),
("encoder", OneHotEncoder(), ["categorical"])
], remainder="passthrough")
)),
("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), ["numerical_missing"])]))
]).fit_transform(df)
This gives me:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames```
How to proceed? Thanks.
Solution
In addition to the question linked to in the comments, and its Linked questions (where I generally suggest defining one pipeline for each combination of transformers you want performed in sequence, and a single column transformer to run those in parallel), in this smallish example I'd like to also suggest an index-based solution to your second attempt.
The output of ColumnTransformer
has columns in order of its transformers, and the remainder at the end. So in your case, the output will be the now-imputed numerical_missing
, followed by some unknown number of one-hot encoded columns, followed by the remainder numerical_ok
. Since you only want to bin the (imputed) numerical_missing
, you can specify the discretizer
column transformer as operating on the 0th column of its input:
Pipeline([
("transformer", (
ColumnTransformer([
("imputer", SimpleImputer(), ["numerical_missing"]),
("encoder", OneHotEncoder(), ["categorical"])
], remainder="passthrough")
)),
("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), [0])]))
]).fit_transform(df)
I tend to prefer using column names, so the single-column-transformer-with-separate-pipelines approach may still be preferable, but this isn't such a bad solution either.
OK, I suppose I might as well also include the approach I keep alluding to.
num_mis_pipe = Pipeline([
("imputer", SimpleImputer()),
("discretizer", KBinsDiscretizer()),
])
ColumnTransformer([
("imp_disc", num_mis_pipe, ["numerical_missing"]),
("encoder", OneHotEncoder(), ["categorical"]),
], remainder="passthrough").fit_transform(df)
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.