Issue
I am trying to use properly pipelines and column transformers from sklearn but always end up with an error. I reproduced it in the following example.
# Data to reproduce the error
X = pd.DataFrame([[1, 2 , 3, 1 ],
[1, '?', 2, 0 ],
[4, 5 , 6, '?']],
columns=['A', 'B', 'C', 'D'])
#SimpleImputer to change the values '?' with the mode
impute = SimpleImputer(missing_values='?', strategy='most_frequent')
#Simple one hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
col_transfo = ColumnTransformer(transformers=[
('missing_vals', impute, ['B', 'D']),
('one_hot', ohe, ['A', 'B'])],
remainder='passthrough'
)
Then calling the transformer as follows:
col_transfo.fit_transform(X)
Returns the following error:
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']
Solution
ColumnTransformer
applies its transformers in parallel, not in sequence. So the OneHotEncoder
sees the un-imputed column B
and balks at the mixed types.
In your case, it's probably fine to just impute on all the columns, and then encode A, B
:
encoder = ColumnTransformer(transformers=[
('one_hot', ohe, ['A', 'B'])],
remainder='passthrough'
)
preproc = Pipeline(steps=[
('impute', impute),
('encode', encoder),
# optionally, just throw the model here...
])
If it's important that future missing values in A,C
cause errors, then similarly wrap impute
into its own ColumnTransformer
.
See also Apply multiple preprocessing steps to a column in sklearn pipeline
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.