Issue
I was trying sklearn pipeline for the first time and using Titanic dataset. I want to first impute missing value in Embarked
and then do one hot encoding. While in Sex
attribute, I just want to do one hot encoding. So, I have the below steps in which two steps are for Embarked
. But it is not working as expected as the Embarked
column remains in addition to its one hot encoding as shown in the output(column having 'S').
If I do imputation and one hot encoding for Embarked
in single step, it is working as expected.
What is the reason behind this or I am doing something wrong? Also, I didn't find any information related to this.
categorical_cols_impute = ['Embarked']
categorical_impute = Pipeline([
("mode_impute", SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='S')),
# ("one_hot", OneHotEncoder(sparse=False))
])
categorical_cols = ['Embarked', 'Sex']
categorical_one_hot = Pipeline([
("one_hot", OneHotEncoder(sparse=False))
])
preprocesor = ColumnTransformer([
("cat_impute", categorical_impute, categorical_cols_impute),
("cat_one_hot", categorical_one_hot, categorical_cols)
], remainder="passthrough")
pipe = Pipeline([
("preprocessor", preprocesor),
# ("model", RandomForestClassifier(random_state=0))
])
Solution
ColumnTransformer
transformers are applied in parallel, not sequentially. So in your example, Embarked
ends up in your transformed data twice: once from the first transformer, keeping its string type, and again from the second transformer, this time one-hot encoded (but not imputed first!(?)).
So just uncomment the second step in the embarked pipeline, and remove Embarked
from categorical_cols
.
See also Consistent ColumnTransformer for intersecting lists of columns (but I don't think it's quite a duplicate).
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.