Issue
I'm using sklearn.pipeline
to transform my features and fit a model, so my general flow looks like this: column transformer --> general pipeline --> model. I would like to be able to extract feature names from the column transformer (since the following step, general pipeline applies the same transformation to all columns, e.g. nan_to_zero) and use them for model explainability (e.g. feature importance). I'd also like it to work with custom transformer classes too.
Here is the set up:
import numpy as np
import pandas as pd
from sklearn import compose, pipeline, preprocessing
df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3], "c": ["x", "y", "z"]})
column_transformer = compose.make_column_transformer(
(preprocessing.StandardScaler(), ["a", "b"]),
(preprocessing.KBinsDiscretizer(n_bins=2, encode="ordinal"), ["a"]),
(preprocessing.OneHotEncoder(), ["c"]),
)
pipe = pipeline.Pipeline([
("transform", column_transformer),
("nan_to_num", preprocessing.FunctionTransformer(np.nan_to_num, validate=False))
])
pipe.fit_transform(df) # returns a numpy array
So far I've tried using get_feature_names_out, e.g.:
pipe.named_steps["transform"].get_feature_names_out()
But I'm running into get_feature_names_out() takes 1 positional argument but 2 were given
, not sure what's going on but this entire process doesn't feel right. Is there a better way to do it?
EDIT: A big thank you to @amiola for answering the question, that was indeed the problem. I just wanted to add another important point for posterity: I was having other problems with my own custom pipeline and I was getting an error get_feature_names_out() takes 1 positional argument but 2 were given
. So it turns out, aside from the KBinsDiscretizer
there was another bug in my custom transformer classes. I implemented the get_feature_names_out
method, but it was not accepting any parameter on my end and that was the problem. If you run into similar issues, then make sure that this method has the following signature: get_feature_names_out(self, input_features) -> List[str]
.
Solution
It seems the problem is generated by the encode="ordinal"
parameter passed to the KBinsDiscretizer
constructor. The bug is tracked in GitHub issue #22731 and GitHub issue #22841 and solved with PR #22735.
Indeed, you might see that by specifying encode="onehot"
you might get a consistent result:
import numpy as np
import pandas as pd
from sklearn import compose, pipeline, preprocessing
df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3], "c": ["x", "y", "z"]})
column_transformer = compose.make_column_transformer(
(preprocessing.StandardScaler(), ["a", "b"]),
(preprocessing.KBinsDiscretizer(n_bins=2, encode="onehot"), ["a"]),
(preprocessing.OneHotEncoder(), ["c"]),
)
pipe = pipeline.Pipeline([
("transform", column_transformer),
("nan_to_num", preprocessing.FunctionTransformer(np.nan_to_num, validate=False))
])
pipe.fit_transform(df)
pipe.named_steps['transform'].get_feature_names_out()
# array(['standardscaler__a', 'standardscaler__b', 'kbinsdiscretizer__a_0.0', 'kbinsdiscretizer__a_1.0','onehotencoder__c_x', 'onehotencoder__c_y', 'onehotencoder__c_z'], dtype=object)
Besides this, everything seems fine to me.
Eventually, apparently, even installing the nightly builds, I still get the same error.
Answered By - amiola
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.