Issue
I'm curious what the value
field is in the nodes of the decision tree produced by Graphviz when used for regression. I understand that this is the number of samples in each class that are separated by a split when using decision tree classification but I'm not sure what it means for regression.
My data has a 2 dimensional input and a 10 dimensional output. Here is an example of what a tree looks like for my regression problem:
produced using this code & visualized with webgraphviz
# X = (n x 2) Y = (n x 10) X_test = (m x 2)
input_scaler = pickle.load(open("../input_scaler.sav","rb"))
reg = DecisionTreeRegressor(criterion = 'mse', max_depth = 2)
reg.fit(X,Y)
pred = reg.predict(X_test)
with open("classifier.txt", "w") as f:
f = tree.export_graphviz(reg, out_file=f)
Solution
What a regression tree actually returns as output is the mean value of the dependent variable (here Y) of the training samples that end up in the respective terminal nodes (leaves); these mean values are shown as lists named value
in the picture, which are all of length 10 here, since your Y is 10-dimensional.
In other words, and using the leftmost terminal node (leaf) of your tree as an example:
- The leaf consists of the 42 samples for which
X[0] <= 0.675
andX[1] <= 0.5
- The mean value of your 10-dimensional output for these 42 samples is given in the
value
list of this leave, which is of length 10 indeed, i.e. the mean ofY[0]
is-152007.382
, the mean ofY[1]
is-206040.675
etc and the mean ofY[9]
is3211.487
.
You can confirm that this is the case by predicting some samples (from your training or test set - it doesn't matter) and checking that your 10-dimensional result is one of the 4 value
lists depicted in the terminal leaves above.
Additionally, you can confirm that, for each element in value
, the weighted averages of the children nodes are equal to the respective element of the parent node. Again, using the first element of your 2 leftmost terminal nodes (leaves), we get:
(-42*152007.382 - 56*199028.147)/98
# -178876.39057142858
i.e. the value[0]
element of their parent node (the leftmost node in the intermediate level). One more example, this time for the first value
elements of your 2 intermediate nodes:
(-98*178876.391 + 42*417378.245)/140
# -0.00020000000617333822
which again agrees with the -0.0
first value
element of your root node.
Judging from the value
list of your root node, it seems that the mean values of all elements of your 10-dimensional Y are almost zero, which you can (and should) verify manually, as a final confirmation.
So, to wrap-up:
- The
value
list of each node contains the mean Y values for the training samples "belonging" to the respective node - Additionally, for the terminal nodes (leaves), these lists are the actual outputs of the tree model (i.e. the output will always be one of these lists, depending on X)
- For the root node, the
value
list contains the mean Y values for the whole of your training dataset
Answered By - desertnaut
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.