Sunday, October 10, 2021

[FIXED] Isn't the purpose of Scikit fit_transform, ColumnTransformer and OneHotEncoder to code categorical data, so why is it used on numerical values

October 10, 2021 machine-learning, one-hot-encoding, python, scikit-learn No comments

Issue

I was searching for machine learning examples to look at and understand and I stumbled upon this example: https://www.kaggle.com/saulalquicira/model-evaluation-using-cross-val-score-and-kfold

I understand everything in the code except for this part:

labelencoder_X = LabelEncoder()
X[:,2] = labelencoder_X.fit_transform(X[:,2])
ct = ColumnTransformer([("cp", OneHotEncoder(), [2])],    remainder = 'passthrough') 
X = ct.fit_transform(X)

ct = ColumnTransformer([("restecg", OneHotEncoder(), [9])],    remainder = 'passthrough')
X = ct.fit_transform(X)

ct = ColumnTransformer([("slope", OneHotEncoder(), [15])],    remainder = 'passthrough')
X = ct.fit_transform(X)

ct = ColumnTransformer([("ca", OneHotEncoder(), [18])],    remainder = 'passthrough')
X = ct.fit_transform(X)

ct = ColumnTransformer([("thal", OneHotEncoder(), [22])],    remainder = 'passthrough')
X = ct.fit_transform(X)

I understand what every individual keyword does, but why are we using this on values that are already numerical in nature, I thought we do this on categorical Data that is alphabetical in nature in order to transform it to numerical binary values that machine learning algorithms can understand. here is how the Dataset looks:

Solution

The features which are being transformed here are technically numerical, but only in representation. You can see that they have already been integer / label-encoded however the data that they represent may be categorical in nature.

When you are working with ordinal data (categorical but there is meaningful order to the feature, i.e. 1 < 2 < 3), label encoding is sufficient. If you are working with truly categorical values which have no meaningful order, it is still useful to one-hot encode or use some other technique to prevent your algorithm from falsely interpreting order from the data.

Answered By - pciunkiewicz

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, October 10, 2021

[FIXED] Isn't the purpose of Scikit fit_transform, ColumnTransformer and OneHotEncoder to code categorical data, so why is it used on numerical values

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels