Thursday, January 25, 2024

[FIXED] Is this one-hot encoding?

January 25, 2024 machine-learning, one-hot-encoding, python, scikit-learn No comments

Issue

Reading :

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

it states "encode categorical integer features using a one-hot aka one-of-K scheme."

Does this also mean it one-hot encodes a list of words ?

From Wikipedia definition ( https://en.wikipedia.org/wiki/One-hot ) of one hot encoding
"In natural language processing, a one-hot vector is a 1 × N matrix (vector) used to distinguish each word in a vocabulary from every other word in the vocabulary. The vector consists of 0s in all cells with the exception of a single 1 in a cell used uniquely to identify the word."

Running code below it appears LabelEncoder is not a correct implementation of one hot encoding whereas OneHotEncoder is a correct implementation :

import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# define example
data = ['w1 w2 w3', 'w1 w2']

values = array(data)
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)

mlb = MultiLabelBinarizer()

print('fit_transform\n' , mlb.fit_transform(data))
print('\none hot\n' , onehot_encoder.fit_transform(integer_encoded))

Prints :

fit_transform
 [[1 1 1 1 1]
 [1 1 1 0 1]]

one hot
 [[0. 1.]
 [1. 0.]]

So LabelEncoder does not one-hot encode , what is the type of encoding used by LabelEncoder ?

From above outputs it appears OneHotEncoder produces a more dense vector than encoding scheme of LabelEncoder.

Update :

How to decide to encode data for machine learning algorithms using LabelEncoder or OneHotEncoder ?

Solution

I think your question is not clear enough...

First, LabelEncoder encodes labels with value between 0 and n_classes-1 while OneHotEncoder encodes categorical integer features using a one-hot aka one-of-K scheme. They are different.

Second, yes OneHotEncoder encodes a list of words. In Wikipedia definition, it says a one-hot vector is a 1 × N matrix. But what is N? Actually, N is the size of your vocabulary.

For example, if you have five words a, b, c, d, e. Then one-hot-encode them:

a -> [1, 0, 0, 0, 0]  # a one-hot 1 x 5 vector
b -> [0, 1, 0, 0, 0]  # a one-hot 1 x 5 vector
c -> [0, 0, 1, 0, 0]  # a one-hot 1 x 5 vector
d -> [0, 0, 0, 1, 0]  # a one-hot 1 x 5 vector
e -> [0, 0, 0, 0, 1]  # a one-hot 1 x 5 vector
# total five one-hot 1 x 5 vectors which can be expressed in a 5 x 5 matrix.

Third, actually I'm not 100% sure what you are asking...

Finally, to answer your updated question. Most of time you should choose one-hot encoding or word embedding. The reason is, the vectors generated by LabelEncoder are too similar which means there isn't much difference between each other. As the similar input are more likely to result similar output. That makes your model difficult to fit.

Answered By - Sraw

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 25, 2024

[FIXED] Is this one-hot encoding?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels