Tuesday, November 1, 2022

[FIXED] ColumnTransformer & Pipeline with OHE - Is the OHE encoded field retained or removed after ct is performed?

November 01, 2022 pipeline, python, scikit-learn No comments

Issue

Doc on CT:

remainder{‘drop’, ‘passthrough’} or estimator, default=’drop’

By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform. Note that using this feature requires that the DataFrame columns input at fit and transform have identical order.

I believe this remainder= is not relevant to the field being OneHot Encoded. I would like to know how is the OHE field (eg. 'CatX') being handled.

When I do a standalone CT transform, I see that 'CatX' does not appear in the output.

ct = ColumnTransformer(transformers=[('OHE',ohe,ohe_col)],remainder='passthrough')

When I do a standalone CT with OHE repeated, it is successful (ie OHE 2 times). This tells me that within CT, the field is still available but only removed on exiting CT.

ct = ColumnTransformer(transformers=[('OHE',ohe,ohe_col),('OHE1',ohe,ohe_col)],
remainder='passthrough')

Then I tried putting this in a Pipeline, I tried doing CT twice. This is the confusing part. It passed. This tells me that the first CT1 passed 'CatX' to CT2.

ct = ColumnTransformer(transformers=[('OHE',ohe,ohe_col)],remainder='passthrough')    
Pipeline([('ct1',ct),('ct2',ct)('model',v)])

Question:

When using Pipeline, who is controlling whether CT would pass 'CatX' on exit ?
When using Pipeline, if the 'CatX' is being passed, then won't the model be able to process it ?

I hope my question is clear. Thanks for any answers in advance.

Solution

ColumnTransformer is meant to transform columns of arrays or dataframes by means of the specified transformation. This means that after fitting you won't have the original column anymore.

import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder

X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'],
              'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath'],
              'expert_rating': [5, 3, 4, 5],
              'user_rating': [4, 5, 4, 3]})

ct = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(), make_column_selector(pattern='city'))], 
    remainder='passthrough', verbose_feature_names_out=False)

pd.DataFrame(ct.fit_transform(X), columns=ct.get_feature_names_out())

At the same time you should be aware of the fact that ColumnTransformer applies its transformers in parallel and that's the reason why in your second example you'll see OHE applied twice, namely because both OHE transformers act on the original data. There are interesting posts on this, see here and here and here eg.

Therefore, this would be the result when applying the transformation twice by means of ColumnTransformer:

ct_d = ColumnTransformer(transformers=[
    ('ohe1', OneHotEncoder(), make_column_selector(pattern='city')),
    ('ohe2', OneHotEncoder(), make_column_selector(pattern='city'))], 
    remainder='passthrough', verbose_feature_names_out=True)

pd.DataFrame(ct_d.fit_transform(X), columns=ct_d.get_feature_names_out())

Then, I'm not sure I properly get your issue when using a Pipeline (perhaps it might be useful to add some details or I might be overlooking at something). That's what I get when avoiding to pass through an instance of ColumnTransformer; you won't find the original column anymore as OneHotEncoder removes it.

from sklearn.pipeline import Pipeline

pipe_new = Pipeline([('ohe', OneHotEncoder())])
pd.DataFrame(pipe_new.fit_transform(pd.DataFrame(X['city'])).toarray())

On the other hand, here's an example that might resemble your use of the Pipeline. Namely, column city is one-hot-encoded in ct1 and its output (column 0 of its output, actually) undergoes the same destiny when ct2 is applied (and the remaining columns are passed through). More specifically, column 0 of ct1 output (np.array([1.0, 1.0, 0.0, 0.0]).T) is one-hot-encoded to become np.array([[0.0, 0.0, 1.0, 1.0], [1.0, 1.0, 0.0, 0.0]]).T (actually it wouldn't be a np array, I'm writing it like this only for ease of notation).

ct = ColumnTransformer(transformers=[
    ('ohe', OneHotEncoder(), [0])], remainder='passthrough', verbose_feature_names_out=False)

pipe = Pipeline([
    ('ct1', ct),
    ('ct2', ct)
])

pd.DataFrame(pipe.fit_transform(X))

Answered By - amiola

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 1, 2022

[FIXED] ColumnTransformer & Pipeline with OHE - Is the OHE encoded field retained or removed after ct is performed?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels