Issue
I trained the GaussianNB model from scikit sklearn. When I call the method classifier.predict_proba
it only returns 1 or 0 on new data. It is expected to return a percentage of confidence that the prediction is correct or not. I doubt it can have 100% confidence on new data it has never seen before. I have tested it on multiple different inputs. I use CountVectorizer and TfidfTransformer for the text encoding.
The encoding:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(X_train_word)
X_train = tfidf_transformer.fit_transform(X_train_counts).toarray()
print(X_train)
X_test_counts = count_vect.transform(X_test_word)
X_test = tfidf_transformer.transform(X_test_counts).toarray()
print(X_test)
The model: (I am getting an accuracy of 91%)
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predict Class
y_pred = classifier.predict(X_test)
# Accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
And finally, when I use the predict_proba method:
y_pred = classifier.predict_proba(X_test)
print(y_pred)
I am getting an output like:
[[0. 1.]
[1. 0.]
[0. 1.]
...
[1. 0.]
[1. 0.]
[1. 0.]]
It doesn't make much sense to have 100% accuracy on new data. Other than on y_test
I have tested it on other inputs and it still returns the same. Any help would be appreciated!
Edit for the comments:
The response of .predict_log_proba()
is even more strange:
[[ 0.00000000e+00 -6.95947375e+09]
[-4.83948755e+09 0.00000000e+00]
[ 0.00000000e+00 -1.26497690e+10]
...
[ 0.00000000e+00 -6.97191054e+09]
[ 0.00000000e+00 -2.25589894e+09]
[ 0.00000000e+00 -2.93089863e+09]]
Solution
Let me reproduce your results on a public 20 newsgroups dataset. For simplicity, I will use only two groups and only 30 observations:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline
cats = ['alt.atheism', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test', categories=cats)
# deliberately create a very small training set
X_small, y_small = newsgroups_train['data'][:30], newsgroups_train['target'][:30]
print(y_small)
# [0 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1]
Now let's train a model. I will use a pipeline to stack together all algorithms in a single processor:
model = make_pipeline(
CountVectorizer(),
TfidfTransformer(),
FunctionTransformer(lambda x: x.todense(), accept_sparse=True),
GaussianNB()
)
model.fit(X_small, y_small);
print(model.predict_proba(newsgroups_test['data']))
# [[1. 0.]
# [0. 1.]
# [1. 0.]
print((model.predict(X_small) == y_small).mean())
# 1.0
print((model.predict(newsgroups_test['data']) == newsgroups_test['target']).mean())
# 0.847124824684432
print(model.predict_proba(newsgroups_test['data']).max(axis=1).mean())
# 0.9994305488454233
In fact, not all predicted probabilities are 0 or 1, but most of them are. The average predicted probability of the predicted class is 99.94%, so the model is on average very confident in its predictions.
We see that accuracy on the training set is perfect, but the accuracy on the test set is only 84.7%. So it seems that our GaussianNB is overfitting - that is, it relies too much on the training dataset. Yes, this is possible even with such a simple algorithm as NB, if the feature space is large. And with CountVectorizer, each word in the vocabulary is a separate feature, and the number of all possible words is quite large. So our model is overfitting, and that's why it is producing overconfident predictions consisting of zeros and ones.
And, as usual, we can fight overfitting using regularization. With GaussianNB, the simplest way to regularize your model is to set the parameter var_smoothing
to some relatively large positive value (by default, it is 10^-8
). From my experience, I suggest values in the range from 0.01 to 1. Here I set it to 0.3. This means that 30% of the variance of the most diverse features (i.e. the words that are distributed most evenly between the classes) will be added to all the other features.
model2 = make_pipeline(
CountVectorizer(),
TfidfTransformer(),
FunctionTransformer(lambda x: x.todense(), accept_sparse=True),
GaussianNB(var_smoothing=0.3)
)
model2.fit(X_small, y_small);
print(model2.predict_proba(newsgroups_test['data']))
# [[1.00000000e+00 6.95414544e-11]
# [2.55262953e-02 9.74473705e-01]
# [9.97333826e-01 2.66617361e-03]
print((model2.predict(X_small) == y_small).mean())
# 1.0
print((model2.predict(newsgroups_test['data']) == newsgroups_test['target']).mean())
# 0.8821879382889201
print(model2.predict_proba(newsgroups_test['data']).max(axis=1).mean())
# 0.9657781853646639
We can see that after adding regularization, the predictions of our model have become less confident: the average confidence is 96.57% instead of 99.94%. Moreover, the accuracy on the test set has improved, because this overconfidence has caused the model to make some incorrect predictions.
The logic of these incorrect predictions can be illustrated as follows. Without regularization, the model fully relies on the frequency of the words in the training set. And when it seems e.g. a text "the probability of dying from X-rays", the model thinks "I have seen the word `dying' only in the texts about atheism, so this must be a text about atheism". But this is a text about space, and a more regularized model will not be so certain in its conclusions and will still reserve some small but non-zero probability that a text with the word "dying" is about some topic other than atheism.
So the lesson here is: whatever learning algorithm you use, find out how to regularize it, and tune the regularization parameter thoughtfully.
Answered By - David Dale
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.