Issue
I'm attempting to use scikit's ColumnTransformer
class as both an actual DataFrame transformer and as a "monitoring" transformer – i.e., an object to monitor when new classes come into categorical features in my dataset.
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Original DataFrame off of which transformers are fit
orig_df = pd.DataFrame(
{
'a': [np.nan, 'a', 'b', 'b', 'a'],
'b': ([np.nan] * 3) + ['a', 'a'],
'c': np.random.randn(5)
}
)
# New DataFrame that will be transformed using already fitted transformer
new_df = pd.DataFrame(
{
'a': [np.nan, 'a', 'b', 'b', 'c'],
'b': ([np.nan] * 4) + ['b'],
'c': np.random.randn(5)
}
)
# Cast NaNs to str to play nicely with OneHotEncoder
for col in ('a', 'b'):
orig_df[col] = orig_df[col].astype(str)
new_df[col] = new_df[col].astype(str)
# Create master transformer for each of the three columns a, b, and c
transformer_config = [
('a', OneHotEncoder(sparse=False, handle_unknown='error'), ['a']),
('b', OneHotEncoder(sparse=False, handle_unknown='error'), ['b']),
('c', 'passthrough', ['c']),
]
transformer = ColumnTransformer(transformer_config)
# Fit to original dataset
transformer.fit(orig_df)
# Transform new dataset
transformer.transform(new_df)
Which produces:
File "<stdin>", line 2, in <module>
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 495, in transform
Xs = self._fit_transform(X, None, _transform_one, fitted=True)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 393, in _fit_transform
fitted=fitted, replace_strings=True))
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 983, in __call__
if self.dispatch_one_batch(iterator):
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
self._dispatch(tasks)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
self.results = batch()
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 261, in __call__
for func, args, kwargs in self.items]
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
for func, args, kwargs in self.items]
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/pipeline.py", line 605, in _transform_one
res = transformer.transform(X)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 591, in transform
return self._transform_new(X)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 553, in _transform_new
X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 109, in _transform
raise ValueError(msg)
ValueError: Found unknown categories ['c'] in column 0 during transform
This produces the error I generally want, but only for one column. As you can see in new_df
, column b
has a new level, too, ('b'
). Is there a straightforward way of reporting back all new levels for all fields that use this OneHotEncoder
class, instead of just the first one that errs out?
My first thought was to try iterating through each field individually, try-catching each ValueError
, but that doesn't play nicely with ColumnTransformer
:
>>> transformer.transform(new_df[['b']])
KeyError: "None of [['a']] are in the [columns]"
Solution
Just a suggested solution for your example:
from sklearn.base import BaseEstimator
for _, t_inst, t_col in transformer.transformers_:
try:
if isinstance(t_inst, BaseEstimator):
t_inst.transform(new_df[t_col])
else:
pass
except Exception as e:
print('During transformation of column {} the following error occurred: {}'.format(t_col, e))
Output
During transformation of column ['a'] the following error occured: Found unknown categories ['c'] in column 0 during transform
During transformation of column ['b'] the following error occured: Found unknown categories ['b'] in column 0 during transform
It simply tries to apply the transformations one by one.
Note that .transformers_
attribute is only available after fitting
Answered By - Jan K
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.