Monday, November 27, 2023

[FIXED] sklearn pipeline get_feature_names_out() fails unless dataframe has matching rename?

November 27, 2023 scikit-learn No comments

Issue

I've been struggling with this for a while, and not sure if it's a bug or I'm misunderstanding something. I have a simple sklearn (1.3.1) pipeline where the first step is renaming its input features, so I implemented feature_names_out as below.

If I fit the pipeline on a numpy array, everything is fine and it reports output feature names as x0__log, x1__log:

from typing import List
import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer, StandardScaler
from sklearn.pipeline import make_pipeline

def with_suffix(_, names: List[str]):
    return [name + '__log' for name in names]

p = make_pipeline(
    FunctionTransformer(np.log1p, feature_names_out=with_suffix),
    StandardScaler()
)

df = pd.DataFrame([[1,2], [3,4], [5,6]], columns=['a', 'b'])

p.fit_transform(df.values)
p.get_feature_names_out()

However if I fit on the dataframe directly with p.fit_transform(df), then p.get_feature_names_out() gives me a stack trace ending with ValueError: input_features is not equal to feature_names_in_. We can see that the first step is outputting the expected names a__log, b__log but then the second step is using the original column names which causes the conflict:

print(p.feature_names_in_)
print(p[0].get_feature_names_out())
print(p[1].feature_names_in_)

['a' 'b']
['a__log' 'b__log']
['a' 'b']

Eventually I realized I could fix it by applying the same renaming to the dataframe columns itself. For example, using this wrapper both p.fit_transform(df) and p.fit_transform(df.values) work as expected.

Is this intended? It seems wrong to rename the features twice?

def wrapped(func):
    def _func(X):
        W = func(X)
        if isinstance(X, pd.DataFrame):
            W = W.rename(columns=dict(zip(X.columns, with_suffix(None, X.columns))))
        return W
    return _func


p = make_pipeline(
    FunctionTransformer(wrapped(np.log1p), feature_names_out=with_suffix),
    StandardScaler()
)

p.fit_transform(df)
p.get_feature_names_out()

Solution

The problem is that FunctionTransformer by default applies func directly to the input without converting the input first; so p[0].transform(df) produces a dataframe with columns still [a, b], and p[1] gets fitted on that frame, setting its feature_names_in_ attribute also to [a, b], which contradicts what comes out of get_feature_names_out (having been passed through your with_suffix).

Probably this should be opened as an Issue on the scikit-learn github.

In the meantime, you can set validate=True in your FunctionTransformer: this will convert the input to a numpy array, so that the subsequent step won't be fitted on a dataframe, so won't have a feature_names_in_ set.

Answered By - Ben Reiniger

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 27, 2023

[FIXED] sklearn pipeline get_feature_names_out() fails unless dataframe has matching rename?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels