Issue
I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI), and p-value to access statistical significance.
Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores.
Here are my specific questions:
- How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g. with bootstrapping).
- How to compare the AUC scores (on test set) and measure the p-value to assess statistical significance? (The null hypothesis is that the models are not different. Rejecting the null hypothesis means the difference in AUC scores is statistically significant.)
.
import numpy as np
np.random.seed(2018)
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import matplotlib
import matplotlib.pyplot as plt
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=17)
# Naive Bayes Classifier
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
nb_prediction_proba = nb_clf.predict_proba(X_test)[:, 1]
# Ranodm Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=20)
rf_clf.fit(X_train, y_train)
rf_prediction_proba = rf_clf.predict_proba(X_test)[:, 1]
# Multi-layer Perceptron Classifier
mlp_clf = MLPClassifier(alpha=1, hidden_layer_sizes=150)
mlp_clf.fit(X_train, y_train)
mlp_prediction_proba = mlp_clf.predict_proba(X_test)[:, 1]
def roc_curve_and_score(y_test, pred_proba):
fpr, tpr, _ = roc_curve(y_test.ravel(), pred_proba.ravel())
roc_auc = roc_auc_score(y_test.ravel(), pred_proba.ravel())
return fpr, tpr, roc_auc
plt.figure(figsize=(8, 6))
matplotlib.rcParams.update({'font.size': 14})
plt.grid()
fpr, tpr, roc_auc = roc_curve_and_score(y_test, rf_prediction_proba)
plt.plot(fpr, tpr, color='darkorange', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, nb_prediction_proba)
plt.plot(fpr, tpr, color='green', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, mlp_prediction_proba)
plt.plot(fpr, tpr, color='crimson', lw=2,
label='ROC AUC={0:.3f}'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.legend(loc="lower right")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity')
plt.ylabel('Sensitivity')
plt.show()
Solution
Bootstrap for 95% confidence interval
You want to repeat your analysis on multiple resamplings of your data. In the general case, assume you have a function f(x)
that determines whatever statistic you need from data x
and you can bootstrap like this:
def bootstrap(x, f, nsamples=1000):
stats = [f(x[np.random.randint(x.shape[0], size=x.shape[0])]) for _ in range(nsamples)]
return np.percentile(stats, (2.5, 97.5))
This gives you so-called plug-in estimates of the 95% confidence interval (i.e. you just take the percentiles of the bootstrap distribution).
In your case, you can write a more specific function like this
def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
auc_values = []
for b in range(nsamples):
idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
clf.fit(X_train[idx], y_train[idx])
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
auc_values.append(roc_auc)
return np.percentile(auc_values, (2.5, 97.5))
Here, clf
is the classifier for which you want to test the performance and X_train
, y_train
, X_test
, y_test
are like in your code.
This gives me the following confidence intervals (rounded to three digits, 1000 bootstrap samples):
- Naive Bayes: 0.986 [0.980 0.988] (estimate, lower and upper limit of confidence interval)
- Random Forest: 0.983 [0.974 0.989]
- Multilayer Perceptron: 0.974 [0.223 0.98]
Permutation tests to test against chance performance
A permutation test would technically go over all permutations of your observation sequence and evaluate your roc curve with the permuted target values (features are not permuted). This is ok if you have a few observations, but it becomes very costly if you more observations. It is therefore common to subsample the number of permutations and simply do a number of random permutations. Here, the implementation depends a bit more on the specific thing you want to test. The following function does that for your roc_auc values
def permutation_test(clf, X_train, y_train, X_test, y_test, nsamples=1000):
idx1 = np.arange(X_train.shape[0])
idx2 = np.arange(X_test.shape[0])
auc_values = np.empty(nsamples)
for b in range(nsamples):
np.random.shuffle(idx1) # Shuffles in-place
np.random.shuffle(idx2)
clf.fit(X_train, y_train[idx1])
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test[idx2].ravel(), pred.ravel())
auc_values[b] = roc_auc
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
return roc_auc, np.mean(auc_values >= roc_auc)
This function again takes your classifier as clf
and returns the AUC value on the unshuffled data and the p-value (i.e. probability to observe an AUC value larger than or equal to what you have in the unshuffled data).
Running this with 1000 samples gives p-values of 0 for all three classifiers. Note that these are not exact because of the sampling, but they are an indicating that all of these classifiers perform better than chance.
Permutation test for differences between classifiers
This is much easier. Given two classifiers, you have prediction for every observation. You just shuffle the assignment between predictions and classifiers like this
def permutation_test_between_clfs(y_test, pred_proba_1, pred_proba_2, nsamples=1000):
auc_differences = []
auc1 = roc_auc_score(y_test.ravel(), pred_proba_1.ravel())
auc2 = roc_auc_score(y_test.ravel(), pred_proba_2.ravel())
observed_difference = auc1 - auc2
for _ in range(nsamples):
mask = np.random.randint(2, size=len(pred_proba_1.ravel()))
p1 = np.where(mask, pred_proba_1.ravel(), pred_proba_2.ravel())
p2 = np.where(mask, pred_proba_2.ravel(), pred_proba_1.ravel())
auc1 = roc_auc_score(y_test.ravel(), p1)
auc2 = roc_auc_score(y_test.ravel(), p2)
auc_differences.append(auc1 - auc2)
return observed_difference, np.mean(auc_differences >= observed_difference)
With this test and 1000 samples, I find no significant differences between the three classifiers:
- Naive bayes vs random forest: diff=0.0029, p(diff>)=0.311
- Naive bayes vs MLP: diff=0.0117, p(diff>)=0.186
- random forest vs MLP: diff=0.0088, p(diff>)=0.203
Where diff denotes the difference in roc curves between the two classifiers and p(diff>) is the empirical probability to observe a larger difference on a shuffled data set.
Answered By - Ingo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.