Issue
I have a Columntransformer like this:
oh = OneHotEncoder(max_categories=10, min_frequency=1000, sparse_output=False, handle_unknown='infrequent_if_exist',
drop='if_binary')
sc = StandardScaler()
min_max_scaler = MinMaxScaler()
coltrans_ = ColumnTransformer(
[("cat_process", oh, cat_cols_),
("scal_process", sc, scal_cols_),
("min_max_process", min_max_scaler, min_max_columns)]
)
When i use this to fit on a dataframe and then transform another i get the following warning:
UserWarning: Found unknown categories in columns [1, 2, 3, 6, 7] during transform. These unknown categories will be encoded as all zeros
Sadly it doesnt tell me which encoder throws this warning but it can only be the OneHotEncoder as the other ones dont use categorical features.
So it try to find out which columns make this warning appear:
coltrans_.transformers_
gives (i marked the columns i think throw the warning):
[('cat_process',
OneHotEncoder(drop='if_binary', handle_unknown='infrequent_if_exist',
max_categories=10, min_frequency=1000, sparse_output=False),
['categorical_1',
'categorical_2',#<--
'categorical_3',#<--
'categorical_4',#<--
'categorical_5',
'categorical_6',
'categorical_7',#<--
'categorical_8'#<--
]),
('scal_process',
StandardScaler(),
['other_1',
'other_2',
'other_3',
'other_4',
'other_5',
'other_6',
'other_7',
'other_8',
'other_9',
'other_10',
'other_11',
'other_12',
'other_13',
'other_14']),
('min_max_process', MinMaxScaler(), [])]
I am not sure if the marked columns are the right ones, because i dont know if the list in the warning is zero or one based.
I thought with min_frequency=1000, handle_unknown='infrequent_if_exist', drop='if_binary'
on the Encoder i should not get a warning. I know there are infrequent features and they should get the infrequent label.
Why do i get the warning?
Solution
You get this error because of two events happening simultaneously:
- During
transform()
, the algorithm encounters categories that it hasn't seen duringfit()
. - During
fit()
, the'infrequent'
column is not created. As a result, the new, unseen categories can't be assigned to'infrequent'
.
A solution is to set handle_unknown='ignore'
. This prevents the algorithm from attempting to assign unknown categories to a non-existent 'infrequent'
column during transform()
.
Answered By - DataJanitor
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.