Tuesday, July 26, 2022

[FIXED] How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

July 26, 2022 machine-learning, neural-network, python, scikit-learn, statistics No comments

Issue

I would like to compare different binary classifiers in Python. For that, I want to calculate the ROC AUC scores, measure the 95% confidence interval (CI), and p-value to access statistical significance.

Below is a minimal example in scikit-learn which trains three different models on a binary classification dataset, plots the ROC curves and calculates the AUC scores.

Here are my specific questions:

How to calculate the 95% confidence interval (CI) of the ROC AUC scores on the test set? (e.g. with bootstrapping).
How to compare the AUC scores (on test set) and measure the p-value to assess statistical significance? (The null hypothesis is that the models are not different. Rejecting the null hypothesis means the difference in AUC scores is statistically significant.)

import numpy as np

np.random.seed(2018)

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import matplotlib
import matplotlib.pyplot as plt

data = load_breast_cancer()

X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=17)

# Naive Bayes Classifier
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
nb_prediction_proba = nb_clf.predict_proba(X_test)[:, 1]

# Ranodm Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=20)
rf_clf.fit(X_train, y_train)
rf_prediction_proba = rf_clf.predict_proba(X_test)[:, 1]

# Multi-layer Perceptron Classifier
mlp_clf = MLPClassifier(alpha=1, hidden_layer_sizes=150)
mlp_clf.fit(X_train, y_train)
mlp_prediction_proba = mlp_clf.predict_proba(X_test)[:, 1]


def roc_curve_and_score(y_test, pred_proba):
    fpr, tpr, _ = roc_curve(y_test.ravel(), pred_proba.ravel())
    roc_auc = roc_auc_score(y_test.ravel(), pred_proba.ravel())
    return fpr, tpr, roc_auc


plt.figure(figsize=(8, 6))
matplotlib.rcParams.update({'font.size': 14})
plt.grid()
fpr, tpr, roc_auc = roc_curve_and_score(y_test, rf_prediction_proba)
plt.plot(fpr, tpr, color='darkorange', lw=2,
         label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, nb_prediction_proba)
plt.plot(fpr, tpr, color='green', lw=2,
         label='ROC AUC={0:.3f}'.format(roc_auc))
fpr, tpr, roc_auc = roc_curve_and_score(y_test, mlp_prediction_proba)
plt.plot(fpr, tpr, color='crimson', lw=2,
         label='ROC AUC={0:.3f}'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.legend(loc="lower right")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity')
plt.ylabel('Sensitivity')
plt.show()

Solution

Bootstrap for 95% confidence interval

You want to repeat your analysis on multiple resamplings of your data. In the general case, assume you have a function f(x) that determines whatever statistic you need from data x and you can bootstrap like this:

def bootstrap(x, f, nsamples=1000):
    stats = [f(x[np.random.randint(x.shape[0], size=x.shape[0])]) for _ in range(nsamples)]
    return np.percentile(stats, (2.5, 97.5))

This gives you so-called plug-in estimates of the 95% confidence interval (i.e. you just take the percentiles of the bootstrap distribution).

In your case, you can write a more specific function like this

def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
    auc_values = []
    for b in range(nsamples):
        idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
        clf.fit(X_train[idx], y_train[idx])
        pred = clf.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
        auc_values.append(roc_auc)
    return np.percentile(auc_values, (2.5, 97.5))

Here, clf is the classifier for which you want to test the performance and X_train, y_train, X_test, y_test are like in your code.

This gives me the following confidence intervals (rounded to three digits, 1000 bootstrap samples):

Naive Bayes: 0.986 [0.980 0.988] (estimate, lower and upper limit of confidence interval)
Random Forest: 0.983 [0.974 0.989]
Multilayer Perceptron: 0.974 [0.223 0.98]

Permutation tests to test against chance performance

A permutation test would technically go over all permutations of your observation sequence and evaluate your roc curve with the permuted target values (features are not permuted). This is ok if you have a few observations, but it becomes very costly if you more observations. It is therefore common to subsample the number of permutations and simply do a number of random permutations. Here, the implementation depends a bit more on the specific thing you want to test. The following function does that for your roc_auc values

def permutation_test(clf, X_train, y_train, X_test, y_test, nsamples=1000):
    idx1 = np.arange(X_train.shape[0])
    idx2 = np.arange(X_test.shape[0])
    auc_values = np.empty(nsamples)
    for b in range(nsamples):
        np.random.shuffle(idx1)  # Shuffles in-place
        np.random.shuffle(idx2)
        clf.fit(X_train, y_train[idx1])
        pred = clf.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test[idx2].ravel(), pred.ravel())
        auc_values[b] = roc_auc
    clf.fit(X_train, y_train)
    pred = clf.predict_proba(X_test)[:, 1]
    roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
    return roc_auc, np.mean(auc_values >= roc_auc)

This function again takes your classifier as clf and returns the AUC value on the unshuffled data and the p-value (i.e. probability to observe an AUC value larger than or equal to what you have in the unshuffled data).

Running this with 1000 samples gives p-values of 0 for all three classifiers. Note that these are not exact because of the sampling, but they are an indicating that all of these classifiers perform better than chance.

Permutation test for differences between classifiers

This is much easier. Given two classifiers, you have prediction for every observation. You just shuffle the assignment between predictions and classifiers like this

def permutation_test_between_clfs(y_test, pred_proba_1, pred_proba_2, nsamples=1000):
    auc_differences = []
    auc1 = roc_auc_score(y_test.ravel(), pred_proba_1.ravel())
    auc2 = roc_auc_score(y_test.ravel(), pred_proba_2.ravel())
    observed_difference = auc1 - auc2
    for _ in range(nsamples):
        mask = np.random.randint(2, size=len(pred_proba_1.ravel()))
        p1 = np.where(mask, pred_proba_1.ravel(), pred_proba_2.ravel())
        p2 = np.where(mask, pred_proba_2.ravel(), pred_proba_1.ravel())
        auc1 = roc_auc_score(y_test.ravel(), p1)
        auc2 = roc_auc_score(y_test.ravel(), p2)
        auc_differences.append(auc1 - auc2)
    return observed_difference, np.mean(auc_differences >= observed_difference)

With this test and 1000 samples, I find no significant differences between the three classifiers:

Naive bayes vs random forest: diff=0.0029, p(diff>)=0.311
Naive bayes vs MLP: diff=0.0117, p(diff>)=0.186
random forest vs MLP: diff=0.0088, p(diff>)=0.203

Where diff denotes the difference in roc curves between the two classifiers and p(diff>) is the empirical probability to observe a larger difference on a shuffled data set.

Answered By - Ingo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, July 26, 2022

[FIXED] How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

Issue

Solution

Bootstrap for 95% confidence interval

Permutation tests to test against chance performance

Permutation test for differences between classifiers

0 comments:

Post a Comment

Popular Posts

Labels