Issue
I am curious how sklearn
handles continuous variables in sklearn.tree.DecisionTreeClassifier
? I tried to use some continuous variables without preprocessing with DecisionTreeClassifier
, but it got an acceptable accuracy.
Below is a kind of way to translate continuous variables into categorical variables, but it can't receive the same accuracy.
def preprocess(data, min_d, max_d, bin_size=3):
norm_data = np.clip((data - min_d) / (max_d - min_d), 0, 1)
categorical_data = np.floor(bin_size*norm_data).astype(int)
return categorical_data
X = preprocess(X, X.min(), X.max(), 3)
Solution
The decision tree splits continuous values at the place where it best distinguishes between the two classes. Say, for example, that a decision tree would split height between men and women at 165 cm, because most people would be correctly classified with this boundary. An algorithm will find that most women are under 165cm, and most men are over 165cm.
A decision tree will find the optimal splitting point for all attributes, often reusing attributes multiple times. See here, a decision tree classifying the Iris dataset according to continuous values from their columns.
For instance, you can see X[3] < 0.8
, where continuous values under 0.8 in some column are classified as class 0. You can see how many instances this split applies to in each class: [50, 0, 0]
.
You're probably having a lower accuracy with your arbitrary split point because you're losing information by doing so. About the height example, imagine if your height data isn't continuous, but you have people above and under 150cm. You're losing a lot of information. A decision tree would also split the continuous data like that, but at least it will 1) find the optimal splitting point, and 2) it will be able to split the same attribute more than once. So it will perform better than your arbitrary split.
Answered By - Nicolas Gervais
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.