Issue
I have some pandas.Series
– s
, below – that I want to one-hot-encode. I've found through research that the 'b'
level is not important for my predictive modeling task. I can exclude it from my analysis like so:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
s = pd.Series(['a', 'b', 'c']).values.reshape(-1, 1)
enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='error')
enc.fit_transform(s)
# array([[1., 0.],
# [0., 0.],
# [0., 1.]])
enc.get_feature_names()
# array(['x0_a', 'x0_c'], dtype=object)
But when I go to transform a new series, one containing both 'b'
and a new level, 'd'
, I get an error:
new_s = pd.Series(['a', 'b', 'c', 'd']).values.reshape(-1, 1)
enc.transform(new_s)
Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 390, in transform X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown) File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform raise ValueError(msg) ValueError: Found unknown categories ['d'] in column 0 during transform
This is to be expected since I set handle_unknown='error'
above. However, I'd like to completely ignore all classes except for ['a', 'c']
in both the fitting and subsequent transforming steps. I tried this:
enc = OneHotEncoder(drop=['b'], sparse=False, handle_unknown='ignore')
enc.fit_transform(s)
enc.transform(new_s)
Traceback (most recent call last): File "", line 1, in File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 371, in fit_transform self._validate_keywords() File "/Users/user/Documents/assets/envs/data-science/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 289, in _validate_keywords "
handle_unknown
must be 'error' when the drop parameter is " ValueError:handle_unknown
must be 'error' when the drop parameter is specified, as both would create categories that are all zero.
It seems this pattern is not supported in scikit-learn. Does anyone know a scikit-learn-compatible pattern to accomplish this task?
Solution
You could also approach this using the following:
class IgnorantOneHotEncoder(OneHotEncoder):
def transform(self, X, y=None):
try:
return super().transform(X)
except ValueError as e:
if 'Found unknown categories' in str(e):
X = np.copy(X)
# Keep track of indices corresponding to unknown categories
unknown_categories_mask = ~np.isin(X, self.categories_[0]).ravel()
# Overwrite the unknown categories in the input matrix, X, with the first known category
X[unknown_categories_mask] = self.categories_[0][0]
# Transform X, whose categories are all known now
X = super().transform(X)
# Overwrite originally unknown-category records with 0 to indicate
# absence of any value for any category for that feature
X[unknown_categories_mask, 0] = 0
return X
else:
raise
Try it out:
>>> ienc = IgnorantOneHotEncoder(sparse=False)
>>> ienc.fit(s)
IgnorantOneHotEncoder(sparse=False)
>>> ienc.transform(s)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
>>> ienc.transform(new_s)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[0., 0., 0.]])
Answered By - blacksite
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.