Issue
I already done the TFIDF using Sklearn but the problem is I can't used english words for stopwords coz mine is in Bahasa Malaysia (non english). What I need is to import my txt file that contain a list of stopwords.
stopword.txt
saya
cintakan
awak
tfidf.py
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['Saya benci awak',
'Saya cinta awak',
'Saya x happy awak',
'Saya geram awak',
'Saya taubat awak']
vocabulary = "taubat".split()
vectorizer = TfidfVectorizer(analyzer='word', vocabulary=vocabulary)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
Solution
You can load your list of specific stop words and pass it as a parameter to the TfidfVectorizer
. In your example:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['Saya benci awak',
'Saya cinta awak',
'Saya x happy awak',
'Saya geram awak',
'Saya taubat awak']
# HERE YOU DO YOUR MAGIC: you open your file and load the list of STOP WORDS
stop_words = [unicode(x.strip(), 'utf-8') for x in open('stopword.txt','r').read().split('\n')]
vectorizer = TfidfVectorizer(analyzer='word', stop_words = stop_words)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
Output with stop_words:
{u'taubat': 2.09861228866811, u'happy': 2.09861228866811, u'cinta': 2.09861228866811, u'benci': 2.09861228866811, u'geram': 2.09861228866811}
Output without stop_words param:
{u'benci': 2.09861228866811, u'taubat': 2.09861228866811, u'saya': 1.0, u'awak': 1.0, u'geram': 2.09861228866811, u'cinta': 2.09861228866811, u'happy': 2.09861228866811}
Warning: I wouldn't use the param
vocabulary
because it is telling theTfidfVectorizer
to only pay attention to the words specified in it and it's usually harder to be aware of all words that you want to take into account than saying the ones you want to dismiss. So, if you remove thevocabulary
param from your example and you add thestop_words
param with your list it will work as you expect.
Answered By - Guiem Bosch
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.