Issue
I am trying to use this paired t-test code for more than 2 ML classifiers and databases:
Whole code and the databases: https://github.com/cemdogdu/stack
def paired_t_test(p):
p_hat = np.mean(p)
n = len(p)
den = np.sqrt(sum([(diff - p_hat)**2 for diff in p]) / (n - 1))
t = (p_hat * (n**(1/2))) / den
p_value = t_dist.sf(t, n-1)*2
return t, p_value
n_tests = 30
p_ = []
rng = np.random.RandomState(42)
for i in range(n_tests):
randint = rng.randint(low=0, high=32767)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=randint)
rf.fit(X_train, y_train)
knn.fit(X_train, y_train)
acc1 = accuracy_score(y_test, rf.predict(X_test))
acc2 = accuracy_score(y_test, knn.predict(X_test))
p_.append(acc1 - acc2)
print("Paired t-test Resampled")
t, p = paired_t_test(p_)
print(f"t statistic: {t}, p-value: {p}\n")
However when I create for loop for several classifiers,
p_ = np.zeros(n_tests)
p = np.zeros((len(clf_list),len(clf_list)))
for ii in range(len(clf_list)):
for jj in range(len(clf_list)):
for kk in tqdm( range(n_tests)):
# clf_list = deepcopy(clf_list_temp)
clf1 = clf_list[ii]
clf2 = clf_list[jj]
it produces different accuracies for each run in the loop that reads the datasets with'''for file in glob.glob(path)'''.
Also, I get sometimes p values bigger than 1, which is not the case when I make the comparisons for each pair single time. What could be the problem here ?
Solution
Regarding why you get different results, if you look at this part of your code:
randint = rng.randint(low=0, high=32767)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=randint)
I suspect when you do a different iteration through your other classifiers etc, your random states are different, hence the different outcomes. Without a reproducible example, we cannot replicate the discrepancy you are seeing.
As for the issues with the p-values, you need to ensure that your test is two sided, so for example, using your code, you can see if the mean of p_
, i.e your t statistic is negative, you end up with a p value > 1.
import numpy as np
from scipy.stats import t as t_dist
np.random.seed(111)
acc1 = np.random.uniform(0,1,10)
acc2 = np.random.uniform(0,1,10)
acc1.mean()
0.3450090833343872
acc2.mean()
0.44340491701581025
paired_t_test(acc1 - acc2)
(-0.9621893188877937, 1.6389080997936225)
paired_t_test(acc2 - acc1)
(0.9621893188877937, 0.3610919002063774)
If you change your code, you ensure that you are testing two sided t statistic:
def paired_t_test(p):
p_hat = np.mean(p)
n = len(p)
den = np.sqrt(sum([(diff - p_hat)**2 for diff in p]) / (n - 1))
t = (p_hat * (n**(1/2))) / den
p_value = t_dist.sf(abs(t), n-1)*2
return t, p_value
Regardless of the differences, we should get the same p-value :
paired_t_test(acc2 - acc1)
(0.9621893188877937, 0.3610919002063774)
paired_t_test(acc1 - acc2)
(-0.9621893188877937, 0.3610919002063774)
Answered By - StupidWolf
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.