Tuesday, February 1, 2022

[FIXED] interpreting Graphviz output for decision tree regression

February 01, 2022 decision-tree, graphviz, machine-learning, regression, scikit-learn No comments

Issue

I'm curious what the value field is in the nodes of the decision tree produced by Graphviz when used for regression. I understand that this is the number of samples in each class that are separated by a split when using decision tree classification but I'm not sure what it means for regression.

My data has a 2 dimensional input and a 10 dimensional output. Here is an example of what a tree looks like for my regression problem:

produced using this code & visualized with webgraphviz

# X = (n x 2)  Y = (n x 10)  X_test = (m x 2)

input_scaler = pickle.load(open("../input_scaler.sav","rb"))
reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2)
reg.fit(X,Y)
pred = reg.predict(X_test)
with open("classifier.txt", "w") as f:
    f = tree.export_graphviz(reg, out_file=f)

Solution

What a regression tree actually returns as output is the mean value of the dependent variable (here Y) of the training samples that end up in the respective terminal nodes (leaves); these mean values are shown as lists named value in the picture, which are all of length 10 here, since your Y is 10-dimensional.

In other words, and using the leftmost terminal node (leaf) of your tree as an example:

The leaf consists of the 42 samples for which X[0] <= 0.675 and X[1] <= 0.5
The mean value of your 10-dimensional output for these 42 samples is given in the value list of this leave, which is of length 10 indeed, i.e. the mean of Y[0] is -152007.382, the mean of Y[1] is -206040.675 etc and the mean of Y[9] is 3211.487.

You can confirm that this is the case by predicting some samples (from your training or test set - it doesn't matter) and checking that your 10-dimensional result is one of the 4 value lists depicted in the terminal leaves above.

Additionally, you can confirm that, for each element in value, the weighted averages of the children nodes are equal to the respective element of the parent node. Again, using the first element of your 2 leftmost terminal nodes (leaves), we get:

(-42*152007.382 - 56*199028.147)/98
# -178876.39057142858

i.e. the value[0] element of their parent node (the leftmost node in the intermediate level). One more example, this time for the first value elements of your 2 intermediate nodes:

(-98*178876.391 + 42*417378.245)/140
# -0.00020000000617333822

which again agrees with the -0.0 first value element of your root node.

Judging from the value list of your root node, it seems that the mean values of all elements of your 10-dimensional Y are almost zero, which you can (and should) verify manually, as a final confirmation.

So, to wrap-up:

The value list of each node contains the mean Y values for the training samples "belonging" to the respective node
Additionally, for the terminal nodes (leaves), these lists are the actual outputs of the tree model (i.e. the output will always be one of these lists, depending on X)
For the root node, the value list contains the mean Y values for the whole of your training dataset

Answered By - desertnaut

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, February 1, 2022

[FIXED] interpreting Graphviz output for decision tree regression

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels