Thursday, September 29, 2022

[FIXED] Passing categorical data to Sklearn Decision Tree

September 29, 2022 decision-tree, python, scikit-learn No comments

Issue

There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these

Some advantages of decision trees are:

(...)

Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.

But running the following script

import pandas as pd 
from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])

outputs the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
    X = check_array(X, dtype=DTYPE, accept_sparse="csc")
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b

I know that in R it is possible to pass categorical data, with Sklearn, is it possible?

Solution

Contrary to the accepted answer, I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.

Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.

Refer to the following code from the documentation:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])

This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform as follows:

list(le.inverse_transform([2, 2, 1]))

This would return ['tokyo', 'tokyo', 'paris'].

Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class.

Hope this helps!

Answered By - Abhinav Arora

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, September 29, 2022

[FIXED] Passing categorical data to Sklearn Decision Tree

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels