Thursday, February 10, 2022

[FIXED] Why the the total number in confusion matrix not same as the data input?

February 10, 2022 machine-learning, nlp, nltk, python, scikit-learn No comments

Issue

Why the total confusion matrix does not have the same number os samples as the dataset? The dataset contains 7514 but the total at confusion matrix not exceed 2000.

Here is the code:

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

corpus = []
for i in range(len(dataset)):
  text = re.sub('[^a-zA-Z]', ' ', dataset['Text'][i])
  text = text.lower()
  text = text.split()
  ps = PorterStemmer()
  text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]
  text = ' '.join(text)
  corpus.append(text)

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 10000)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

from sklearn import linear_model
classifier = linear_model.LogisticRegression(C=10)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix:\n",cm)

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
score1 = accuracy_score(y_test,y_pred)
score2 = precision_score(y_test,y_pred)
score3= recall_score(y_test,y_pred)
print("\n")
print("Accuracy is ",round(score1*100,2),"%")
print("Precision is ",round(score2,2))
print("Recall is ",round(score3,2))

Solution

After you split data using train_test_split, you are left with 2255 samples in the test portion which is almost equal to 7514 X 0.3, then you determined the confusion matrix using this portion (test-portion). Now everything should make sense.

Answered By - meti

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, February 10, 2022

[FIXED] Why the the total number in confusion matrix not same as the data input?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels