Issue
I am unable to use remainder='passthrough' whenever I am using the StandardScaler and OnehotEncoding at the same time. Whichever way I am putting it, I have a problem. it's either keyword before argument,a problem with the fit_tranform... you name it. Here what I am doing :
trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education',
'default','housing','loan','contact','month','poutcome']),remainder='passthrough')
trans_cols.fit_transform(X)
here are my columns:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
'loan', 'contact', 'month', 'duration', 'campaign', 'pdays', 'previous',
'poutcome', 'y'],
dtype='object')
The code above works, I am just not able to combine the 2 estimators when using the remainder key argument. Here is why I am trying:
trans_cols= make_column_transformer((OneHotEncoder(),['job', 'marital', 'education', 'default','housing','loan',
'contact','month','poutcome']),remainder='passthrough',
(StandardScaler(),['age', 'job', 'marital', 'education', 'default', 'balance',
'housing','loan', 'contact', 'month', 'duration',
'campaign', 'pdays', 'previous','poutcome']))
However, the above does not work until I remove remainder
and keep 2 tuples. Which understandable. however, doing that it is trying to encode some of my numeric and I have a a message telling that it encountered some columns that have float.Plus my accuracy drops severely.
Solution
The preferred practice is not to use StandardScaler on one-hot-encoded columns. The first example below demonstrates the application of OHE on the categorical variables and StandardScaler on the numeric columns. The second example, shows the sequential application of OHE on selected columns and StandardScaler on all columns, but this is not recommended.
Example_1:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
df = pd.DataFrame({'Cat_Var': np.random.choice(['a', 'b'], size=5),
'Num_Var': np.arange(5)})
cat_cols = ['Cat_Var']
num_cols = ['Num_Var']
col_transformer = make_column_transformer(
(OneHotEncoder(), cat_cols),
remainder=StandardScaler())
X = col_transformer.fit_transform(df)
Output:
df
Out[57]:
Cat_Var Num_Var
0 b 0
1 a 1
2 b 2
3 a 3
4 a 4
X
Out[58]:
array([[ 0. , 1. , -1.41421356],
[ 1. , 0. , -0.70710678],
[ 0. , 1. , 0. ],
[ 1. , 0. , 0.70710678],
[ 1. , 0. , 1.41421356]])
Example 2:
col_transformer_2 = ColumnTransformer(
[('cat_transform', OneHotEncoder(), cat_cols)],
remainder='passthrough'
)
pipe = Pipeline(
[
('col_tranform', col_transformer_2),
('standard_scaler', StandardScaler())
])
X_2 = pipe.fit_transform(df)
Output:
X_2
Out[62]:
array([[-1.22474487, 1.22474487, -1.41421356],
[ 0.81649658, -0.81649658, -0.70710678],
[-1.22474487, 1.22474487, 0. ],
[ 0.81649658, -0.81649658, 0.70710678],
[ 0.81649658, -0.81649658, 1.41421356]])
Answered By - KRKirov
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.