Issue
I am trying to create a decision tree based on some training data. I have never created a decision tree before, but have completed a few linear regression models. I have 3 questions:
With linear regression I find it fairly easy to plot graphs, fit models, group factor levels, check P statistics etc. in an iterative fashion until I end up with a good predictive model. I have no idea how to evaluate a decision tree. Is there a way to get a summary of the model, (for example, .summary() function in statsmodels)? Should this be an iterative process where I decide whether a factor is significant - if so how can I tell?
I have been very unsuccessful in visualising the decision tree. On the various different ways I have tried, the code seems to run without any errors, yet nothing appears / plots. The only thing I can do successfully is
tree.export_text(model)
, which just states feature_1, feature_2, and so on. I don't know what any of the features actually are. Has anybody come across these difficulties with visualising / have a simple solution?The confusion matrix that I have generated is as follows:
[[ 0 395] [ 0 3319]]
i.e. the model is predicting all rows to the same outcome. Does anyone know why this might be?
Solution
Scikit-learn is a library designed to build predictive models, so there are no tests of significance, confidence intervals, etc. You can always build your own statistics, but this is a tedious process. In scikit-learn, you can eliminate features recursively using RFE, RFECV, etc. You can find a list of feature selection algorithms here. For the most part, these algorithms get rid off the least important feature in each loop according to
feature_importances
(where the importance of each feature is defined as its contribution to the reduction in entropy, gini, etc.).The most straight forward way to visualize a tree is
tree.plot_tree()
. In particular, you should try passing the names of the features tofeature_names
. Please show us what you have tried so far if you want a more specific answer.Try another
criterion
, set a highermax_depth
, etc. Sometimes datasets have unidentifiable records. For example, two observations with the exact same values in all features, but different target labels. Is this the case in your dataset?
Answered By - Arturo Sbr
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.