Issue
I have a dataset with categorical features and my pre-processing is the following:
dataset = data_df.values
#Spliting dataset:
X = dataset[:,0:8]
y = dataset[:,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
#columns to one-hot-encoding:
categorical_cols = ['emp','ed','st','jb','br']
categorical_cols_idx = [data_df.columns.get_loc(c) for c in categorical_cols]
del data_df
#Train Encoding
ohe = OneHotEncoder(sparse = False)
X_train_enc = ohe.fit_transform(X_train[:,categorical_cols_idx])
X_train_new = np.concatenate((X_train[:,:3],X_train_enc),axis = 1)
del X_train
del X_train_enc
#Test Encoding
X_test_enc = ohe.transform(X_test[:,categorical_cols_idx])
X_test_new = np.concatenate((X_test[:,:3],X_test_enc),axis = 1)
del X_test
del X_test_enc
but i doubt, that my pre-processing step is optimal. How should i optimize it?
Solution
You can use a make_column_transformer
and make_column_selector
to perform this operation.
make_column_transformer
will specify which operation to perform on which columns and make_column_selector
will select columns according to their types. In addition to that, you can handle element that are not treated in the ColumnTransformer with the argument remainder
. Here remainder='passthrough'
will simply return the features that are not object
as-is.
As Follows:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer, make_column_selector
import pandas as pd
df = pd.DataFrame({
'emp' : ['A', 'A', 'B'],
'ed' : ['A', 'A', 'B'],
'st' : ['A', 'A', 'B'],
'jb' : ['A', 'A', 'B'],
'br' : ['A', 'A', 'B'],
'other_feature_1' : [0,1,2],
'other_feature_2' : [2,3,4]
})
preprocessing = make_column_transformer((OneHotEncoder(sparse=False), make_column_selector(dtype_include='object')),
remainder='passthrough'
)
preprocessing.fit_transform(df)
This will output the following array:
array([[1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 0., 2.],
[1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 3.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 2., 4.]])
Answered By - Antoine Dubuis
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.