Issue
Consider the following code from the scikitlearn website,
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)
This will allow me to represent categorical information as binary input. The output of the code:
enc.get_feature_names()
is
array(['x0_Female', 'x0_Male', 'x1_1', 'x1_2','x1_3'],dtype=object)
which shows the new features in the transformed space. However, why should it represent female and male separately? This is mutually exclusive information that should be able to be represented as a single feature where 0 -> 'female' and 1 -> 'male', for example. Running the code,
enc.transform([['Female', 1], ['Male', 2]]).toarray()
the output is
array([[1., 0., 1., 0., 0.],
[0., 1., 0., 1., 0.]])
Since there are only two possible inputs for that category, then the first two elements of each row will either be 0-1 for male or 1-0 for female. The correlation between them will be -1. This information can be represented as a single feature, why does it make 2?
Solution
OneHotEncoder
can not know what do you want and need. But in any case it should not behave differently for features containing 2 and 100 categories.
Imagine you have 5 or 100 categories within a feature. Maybe by chance it would drop the category X
, that has very strong correlation with the target. Then your ML algorithm would have hard time to generalize well (for example, a tree-based algorithm would need to set splits that all the rest of 4 or 99 binary columns are equal to 0, which leads to many splits)
But indeed, there is redundant information. OneHotEncoder
does not allow to configure the transformation to drop one of the categories (which could be beneficial for linear models, for example). If you really need that functionality, you can use pandas.get_dummies instead. It has drop_first
argument and it by default transforms only categorical features instead of all features.
Answered By - Mischa Lisovyi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.