Friday, June 3, 2022

[FIXED] sklearn pipelines: ColumnTransformer doesn't execute steps sequentially and pipeline doesn't keep feature names

June 03, 2022 python, scikit-learn No comments

Issue

I have the following (hypothetical) dataset:

numerical_ok	numerical_missing	categorical
210	30	cat1
180	NaN	cat2
70	19	cat3

Where categorical is a string and numerical_ok and numerical_missing are both numerical columns but the last one has some missing points. I want to perform these three tasks:

OneHotEncode categorical column
Impute NAs in numerical_missing using SimpleInputer or another imputer presented in sklearn
Apply KBinsDiscretizer in numerical_missing

Of course this is quite easy to do if I use a mixed pandas/sklearn approach:

df["numerical_missing"] = SimpleImputer().fit_transform(df[["numerical_missing"]])

ColumnTransformer([
    ("encoder", OneHotEncoder(), ["categorical"]),
    ("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)

But for reasons of scalibility and consistency (I'll later fit a model to this data) I'd like to see how this could work using pipelines. I tried two ways to do that:

WAY 1: Using a single ColumnTransformer.

But since it seems to execute jointly all the steps, KBinsDiscretizer runs with still missing data:

ColumnTransformer([
    ("imputer", SimpleImputer(), ["numerical_missing"]),
    ("encoder", OneHotEncoder(), ["categorical"]),
    ("discretizer", KBinsDiscretizer(), ["numerical_missing"])
], remainder="passthrough").fit_transform(df)

Giving this error:

KBinsDiscretizer does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

WAY 2: Composing two ColumnTransformers into a Pipeline

Now the product of the first Pipeline is a sparse array where I cannot access the original feature names.

Pipeline([
    ("transformer", (
        ColumnTransformer([
            ("imputer", SimpleImputer(), ["numerical_missing"]),
            ("encoder", OneHotEncoder(), ["categorical"])
        ], remainder="passthrough")
    )),
    ("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), ["numerical_missing"])]))
]).fit_transform(df)

This gives me:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames```

How to proceed? Thanks.

Solution

In addition to the question linked to in the comments, and its Linked questions (where I generally suggest defining one pipeline for each combination of transformers you want performed in sequence, and a single column transformer to run those in parallel), in this smallish example I'd like to also suggest an index-based solution to your second attempt.

The output of ColumnTransformer has columns in order of its transformers, and the remainder at the end. So in your case, the output will be the now-imputed numerical_missing, followed by some unknown number of one-hot encoded columns, followed by the remainder numerical_ok. Since you only want to bin the (imputed) numerical_missing, you can specify the discretizer column transformer as operating on the 0th column of its input:

Pipeline([
    ("transformer", (
        ColumnTransformer([
            ("imputer", SimpleImputer(), ["numerical_missing"]),
            ("encoder", OneHotEncoder(), ["categorical"])
        ], remainder="passthrough")
    )),
    ("discretizer", ColumnTransformer([("discretizer", KBinsDiscretizer(), [0])]))
]).fit_transform(df)

I tend to prefer using column names, so the single-column-transformer-with-separate-pipelines approach may still be preferable, but this isn't such a bad solution either.

OK, I suppose I might as well also include the approach I keep alluding to.

num_mis_pipe = Pipeline([
    ("imputer", SimpleImputer()),
    ("discretizer", KBinsDiscretizer()),
])
ColumnTransformer([
    ("imp_disc", num_mis_pipe, ["numerical_missing"]),
    ("encoder", OneHotEncoder(), ["categorical"]),
], remainder="passthrough").fit_transform(df)

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, June 3, 2022

[FIXED] sklearn pipelines: ColumnTransformer doesn't execute steps sequentially and pipeline doesn't keep feature names

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels