Issue
I've been struggling with this for a while, and not sure if it's a bug or I'm misunderstanding something. I have a simple sklearn (1.3.1) pipeline where the first step is renaming its input features, so I implemented feature_names_out
as below.
If I fit the pipeline on a numpy array, everything is fine and it reports output feature names as x0__log
, x1__log
:
from typing import List
import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.pipeline import make_pipeline
def with_suffix(_, names: List[str]):
return [name + '__log' for name in names]
p = make_pipeline(
FunctionTransformer(np.log1p, feature_names_out=with_suffix),
StandardScaler()
)
df = pd.DataFrame([[1,2], [3,4], [5,6]], columns=['a', 'b'])
p.fit_transform(df.values)
p.get_feature_names_out()
However if I fit on the dataframe directly with p.fit_transform(df)
, then p.get_feature_names_out()
gives me a stack trace ending with ValueError: input_features is not equal to feature_names_in_
. We can see that the first step is outputting the expected names a__log, b__log
but then the second step is using the original column names which causes the conflict:
print(p.feature_names_in_)
print(p[0].get_feature_names_out())
print(p[1].feature_names_in_)
['a' 'b']
['a__log' 'b__log']
['a' 'b']
Eventually I realized I could fix it by applying the same renaming to the dataframe columns itself. For example, using this wrapper both p.fit_transform(df)
and p.fit_transform(df.values)
work as expected.
Is this intended? It seems wrong to rename the features twice?
def wrapped(func):
def _func(X):
W = func(X)
if isinstance(X, pd.DataFrame):
W = W.rename(columns=dict(zip(X.columns, with_suffix(None, X.columns))))
return W
return _func
p = make_pipeline(
FunctionTransformer(wrapped(np.log1p), feature_names_out=with_suffix),
StandardScaler()
)
p.fit_transform(df)
p.get_feature_names_out()
Solution
The problem is that FunctionTransformer
by default applies func
directly to the input without converting the input first; so p[0].transform(df)
produces a dataframe with columns still [a, b]
, and p[1]
gets fitted on that frame, setting its feature_names_in_
attribute also to [a, b]
, which contradicts what comes out of get_feature_names_out
(having been passed through your with_suffix
).
Probably this should be opened as an Issue on the scikit-learn github.
In the meantime, you can set validate=True
in your FunctionTransformer
: this will convert the input to a numpy array, so that the subsequent step won't be fitted on a dataframe, so won't have a feature_names_in_
set.
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.