Issue
I am fitting a scikit-learn LabelEncoder on a column in a pandas df.
How is the order, in which the encountered strings are mapped to the integers, determined? Is it deterministic?
More importantly, can I specify this order?
import pandas as pd
from sklearn import preprocessing
df = pd.DataFrame(data=["first", "second", "third", "fourth"], columns=['x'])
le = preprocessing.LabelEncoder()
le.fit(df['x'])
print list(le.classes_)
### this prints ['first', 'fourth', 'second', 'third']
encoded = le.transform(["first", "second", "third", "fourth"])
print encoded
### this prints [0 2 3 1]
I would expect le.classes_
to be ["first", "second", "third", "fourth"]
and then encoded
to be [0 1 2 3
], since this is the order in which the strings appear in the column. Can this be done?
Solution
It's done in sort order. In the case of strings, it is done in alphabetic order. There's no documentation for this, but looking at the source code for LabelEncoder.transform we can see the work is mostly delegated to the function numpy.setdiff1d, with the following documentation:
Find the set difference of two arrays.
Return the sorted, unique values in ar1 that are not in ar2.
(Emphasis mine).
Note that since this is not documented, it is probably implementation defined and can be changed between versions. It could be that just the version I looked use the sort order, and other versions of scikit-learn may change this behavior (by not using numpy.setdiff1d).
Answered By - Mephy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.