Issue
My teacher explained that it is important to encode categorical variables after train-test split to prevent data leakage and demonstrated it by an example using LabelEncoder (from sklearn) but when I tried to do the same on another dataframe, containing more than 1000 different labels/objects in the column (datatype: object), I faced a problem:
valueerror: y contains new labels when using scikit learns labelencoder
To solve this issue I encoded before train-test split, which is causing overffiting. Is there any way to encode/handle unseen labels in training data using LabelEncoder? There are lot of different values in the column, so using hot encoder (which can handle unknown values) is not feasible. If there any other alternative why is not too complex (I'm new to ML and Python) ?
Solution
LabelEncoder
is used to encode target labels y
and therefore should not be used to encode X
variables.
However scikit-learn OrdinalEncoder
is doing the same transformation for X
variable. In addition to that, it provides an argument to handle unknown input. You can use it as follow:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(handle_unknown='ignore', unknown_value=np.nan)
The OrdinalEncoder
will basically return NaN
for new unknown categories.
NB: This feature is available with scikit-learn version >0.24
.
Answered By - Antoine Dubuis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.