Issue
I am trying to do sentiment analysis for text. I have 909 phrases commonly used in emails, and I scored them out of ten for how angry they are, when isolated.
Now, I upload this .csv file to a Jupyter Notebook, where I import the following modules:import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
Now, I define both columns as 'phrases' and 'anger':
df=pd.read_csv('Book14.csv', names=['Phrase', 'Anger'])
df_x = df['Phrase']
df_y = df['Anger']
Subsequently, I split this data such that 20% is used for testing and 80% is used for training:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)
Now, I convert the words in x_train
to numerical data using TfidfVectorizer:
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='en')
x_traincv = tfidfvectorizer.fit_transform(x_train.astype('U'))
Now, I convert x_traincv
to an array:
a = x_traincv.toarray()
I also convert x_testcv
to a numerical array:
x_testcv=tfidfvectorizer.fit_transform(x_test)
x_testcv = x_testcv.toarray()
Now, I have
mnb = MultinomialNB()
b=np.array(y_test)
error_score = 0
b=np.array(y_test)
for i in range(len(x_test)):
mnb.fit(x_testcv,y_test)
testmessage=x_test.iloc[i]
predictions = mnb.predict(x_testcv[i].reshape(1,-1))
error_score = error_score + (predictions-int(b[i]))**2
print(testmessage)
print(predictions)
print(error_score/len(x_test))
However, an example of the results I get are:
Bring it back [0] It is greatly appreciatd when [0] Apologies in advance [0] Can you please [0] See you then [0] I hope this email finds you well. [0] Thanks in advance [0] I am sorry to inform [0] You’re absolutely right [0] I am deeply regretful [0] Shoot me through [0] I’m looking forward to [0] As I already stated [0] Hello [0] We expect all students [0] If it’s not too late [0]
and this repeats on a large scale, even for phrases that are obviously very angry. When I removed all data containing a '0' from the .csv file, the now modal value (a 10) is the only prediction for my sentences.
Why is this happening? Is it some weird way to minimise error? Are there any inherent flaws in my code? Should I take a different approach? Many thanks.Solution
Two things, you are fitting The MultinomialNB with the test set. In your loop you have mnb.fit(x_testcv,y_test)
but you should do mnb.fit(x_traincv,y_train)
Second, when performing pre-processing you should call the fit_transform
only on the training data while on the test you should call only the transform
method.
Answered By - Francesco Busolin
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.