Issue
Why the total confusion matrix does not have the same number os samples as the dataset? The dataset contains 7514 but the total at confusion matrix not exceed 2000.
Here is the code:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(len(dataset)):
text = re.sub('[^a-zA-Z]', ' ', dataset['Text'][i])
text = text.lower()
text = text.split()
ps = PorterStemmer()
text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]
text = ' '.join(text)
corpus.append(text)
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 10000)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
from sklearn import linear_model
classifier = linear_model.LogisticRegression(C=10)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print ("Confusion Matrix:\n",cm)
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
score1 = accuracy_score(y_test,y_pred)
score2 = precision_score(y_test,y_pred)
score3= recall_score(y_test,y_pred)
print("\n")
print("Accuracy is ",round(score1*100,2),"%")
print("Precision is ",round(score2,2))
print("Recall is ",round(score3,2))
Solution
After you split data using train_test_split
, you are left with 2255
samples in the test portion which is almost equal to 7514 X 0.3, then you determined the confusion matrix using this portion (test-portion). Now everything should make sense.
Answered By - meti
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.