Issue
I was searching for machine learning examples to look at and understand and I stumbled upon this example: https://www.kaggle.com/saulalquicira/model-evaluation-using-cross-val-score-and-kfold
I understand everything in the code except for this part:
labelencoder_X = LabelEncoder()
X[:,2] = labelencoder_X.fit_transform(X[:,2])
ct = ColumnTransformer([("cp", OneHotEncoder(), [2])], remainder = 'passthrough')
X = ct.fit_transform(X)
ct = ColumnTransformer([("restecg", OneHotEncoder(), [9])], remainder = 'passthrough')
X = ct.fit_transform(X)
ct = ColumnTransformer([("slope", OneHotEncoder(), [15])], remainder = 'passthrough')
X = ct.fit_transform(X)
ct = ColumnTransformer([("ca", OneHotEncoder(), [18])], remainder = 'passthrough')
X = ct.fit_transform(X)
ct = ColumnTransformer([("thal", OneHotEncoder(), [22])], remainder = 'passthrough')
X = ct.fit_transform(X)
I understand what every individual keyword does, but why are we using this on values that are already numerical in nature, I thought we do this on categorical Data that is alphabetical in nature in order to transform it to numerical binary values that machine learning algorithms can understand. here is how the Dataset looks:
Solution
The features which are being transformed here are technically numerical, but only in representation. You can see that they have already been integer / label-encoded however the data that they represent may be categorical in nature.
When you are working with ordinal data (categorical but there is meaningful order to the feature, i.e. 1 < 2 < 3), label encoding is sufficient. If you are working with truly categorical values which have no meaningful order, it is still useful to one-hot encode or use some other technique to prevent your algorithm from falsely interpreting order from the data.
Answered By - pciunkiewicz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.