Issue
I start with the example given for ROC Curve with Visualization API:
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import RocCurveDisplay
from sklearn.model_selection import train_test_split
X, y = load_wine(return_X_y=True)
y = y == 2
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(X_train, y_train)
ax = plt.gca()
rfc_disp = RocCurveDisplay.from_estimator(rfc, X_test, y_test, ax=ax, alpha=0.8)
print(rfc_disp.roc_auc)
with the answer 0.9823232323232323
.
Following this immediately by
from sklearn.metrics import roc_auc_score
y_pred = rfc.predict(X_test)
auc = roc_auc_score(y_test, y_pred)
print(auc)
I obtain 0.928030303030303
, which is manifestly different.
Interestingly, I obtain the same result with the ROC Curve Visualization API, if I use the predicted values:
rfc_disp1 = RocCurveDisplay.from_predictions(y_test, y_pred)
print(rfc_disp1.roc_auc)
However the area under the curve obtained does sum up to the former result (using trapezoid integration):
import numpy as np
I = np.sum(np.diff(rfc_disp.fpr) * (rfc_disp.tpr[1:] + rfc_disp.tpr[:-1])/2.)
print(I)
What is the reason for this discrepancy? I assume that it is related to how teh two functions calculate AUC (perhaps different way of smoothing the curve?) This brings me to a more general question: how is ROC curve obtained for random forest in sklearn? - what parameter/threshold is changed to obtain different predictions? Are these just scores for separate trees of the forest?
Solution
You should use predict_proba for AUC.
try this one:
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test, rfc.predict_proba(X_test)[:, 1])
print(auc)
Answered By - Atacan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.