Issue
I have a question to you (maybe i don't understand something). My target value is binary (Yes/No). I make a prediction using scikit learn for few classiefiers, and plot the roc curve. Everything look good except for roc curves for DecisionTreeClassifier() , ExtraTreeClassifier(). I get something like this:
In other classiefiers i get something similar to this:
I tried all scikit function to display roc curve and i get the same plot. Could you show me the path how can i "improve" my model or plot? My code for DecisionTreeClassifier():
model3 = make_pipeline(preprocessor, DecisionTreeClassifier())
model3[:-1].get_feature_names_out()
m=model3[:-1].get_feature_names_out()
model3 = model3.fit(data_train, target_train)
plt.figure(figsize=(12,12))
plot_tree(model3.named_steps['decisiontreeclassifier'], fontsize=10, node_ids=True,
feature_names=m, max_depth=5)
cm3 = confusion_matrix(target_test, y_pred3, normalize='all')
cm3_display = ConfusionMatrixDisplay(cm3).plot()
plt.xlabel('Klasa predykowana – wynik testu')
plt.ylabel('Klasa rzeczywista')
plt.show()
RocCurveDisplay.from_estimator(model3, data_test, target_test)
plt.show()
RocCurveDisplay.from_predictions(target_test, y_pred3)
plt.show()
model3_probs = model3.predict_proba(data_test)
model3_probs = model3_probs[:, 1]
model3_fpr, model3_tpr, _ = roc_curve(target_test, model3_probs)
roc_auc = metrics.auc(model3_fpr, model3_tpr)
display = metrics.RocCurveDisplay(fpr=model3_fpr, tpr=model3_tpr,
roc_auc=roc_auc,estimator_name='example estimator')
display.plot()
Solution
How many examples are included in your test data? Possibly only three? What the curve looks like will depend also on the number of samples that you used for testing. Using more samples will produce a more similar output to your expectations.
For clarification: The curve is produced by increasing a cut-off threshold for your predictions and plotting the false positive and true positive rates for an increasing value of this threshold (see here). If you only include very few samples, your curve only has very few points to plot. Thus, it looks just like yours.
Edit from the comments: The thing that happened here is, that your tree is filled completely by your data due to the unlimited complexity of the tree (e.g. when not setting a max_depth or min_samples for each leaf). This means that all leafs (also at test-time) are pure and your predictions have only probabilities of 0 and 1, nothing in between. Thus, since the threshold does not matter, your ROC only changes once from (0,0) to the point determined by the (fpr,tpr) in the test-set and to (1,1), looking like your curve. This can be circumvented by using a RandomForest (introducing randomness) or restricting the decision tree in order to get probabilities in between 0 and 1 by counting what's in the leaf. A related thread can be found here.
Nevertheless, there is nothing wrong with your plot, if pure leafs are okay for you!
Answered By - Baradrist
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.