Tuesday, October 4, 2022

[FIXED] using make_column_transformer with OnehotEncoder and StandaScaler + passthrough

October 04, 2022 machine-learning, python, scikit-learn No comments

Issue

I am unable to use remainder='passthrough' whenever I am using the StandardScaler and OnehotEncoding at the same time. Whichever way I am putting it, I have a problem. it's either keyword before argument,a problem with the fit_tranform... you name it. Here what I am doing :

trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 
 'default','housing','loan','contact','month','poutcome']),remainder='passthrough')

trans_cols.fit_transform(X)

here are my columns:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
   'loan', 'contact', 'month', 'duration', 'campaign', 'pdays', 'previous',
   'poutcome', 'y'],
  dtype='object')

The code above works, I am just not able to combine the 2 estimators when using the remainder key argument. Here is why I am trying:

trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 'default','housing','loan',
                                                  'contact','month','poutcome']),remainder='passthrough',

(StandardScaler(),['age', 'job', 'marital', 'education', 'default', 'balance',
                  'housing','loan', 'contact', 'month', 'duration',
                  'campaign', 'pdays', 'previous','poutcome']))

However, the above does not work until I remove remainder and keep 2 tuples. Which understandable. however, doing that it is trying to encode some of my numeric and I have a a message telling that it encountered some columns that have float.Plus my accuracy drops severely.

Solution

The preferred practice is not to use StandardScaler on one-hot-encoded columns. The first example below demonstrates the application of OHE on the categorical variables and StandardScaler on the numeric columns. The second example, shows the sequential application of OHE on selected columns and StandardScaler on all columns, but this is not recommended.

Example_1:

import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

df = pd.DataFrame({'Cat_Var': np.random.choice(['a', 'b'], size=5),
                   'Num_Var': np.arange(5)})

cat_cols = ['Cat_Var']
num_cols = ['Num_Var']

col_transformer = make_column_transformer(
        (OneHotEncoder(), cat_cols),
        remainder=StandardScaler())

X = col_transformer.fit_transform(df)

Output:

df
Out[57]: 
  Cat_Var  Num_Var
0       b        0
1       a        1
2       b        2
3       a        3
4       a        4

X
Out[58]: 
array([[ 0.        ,  1.        , -1.41421356],
       [ 1.        ,  0.        , -0.70710678],
       [ 0.        ,  1.        ,  0.        ],
       [ 1.        ,  0.        ,  0.70710678],
       [ 1.        ,  0.        ,  1.41421356]])

Example 2:

col_transformer_2 = ColumnTransformer(
        [('cat_transform', OneHotEncoder(), cat_cols)],
        remainder='passthrough'
        )

pipe = Pipeline(
        [
         ('col_tranform', col_transformer_2),
         ('standard_scaler', StandardScaler())
         ])

X_2 = pipe.fit_transform(df)

Output:

X_2
Out[62]: 
array([[-1.22474487,  1.22474487, -1.41421356],
       [ 0.81649658, -0.81649658, -0.70710678],
       [-1.22474487,  1.22474487,  0.        ],
       [ 0.81649658, -0.81649658,  0.70710678],
       [ 0.81649658, -0.81649658,  1.41421356]])

Answered By - KRKirov

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 4, 2022

[FIXED] using make_column_transformer with OnehotEncoder and StandaScaler + passthrough

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels