Issue
I have a question about sklearn's TfidfVectorizer when it's doing the frequency of the word in each documents.
the sample code I saw is:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
>>> 'The dog ate a sandwich and I ate a sandwich',
>>> 'The wizard transfigured a sandwich'
>>> ]
>>> vectorizer = TfidfVectorizer(stop_words='english')
>>> print vectorizer.fit_transform(corpus).todense()
[[ 0.75458397 0.37729199 0.53689271 0. 0. ]
[ 0. 0. 0.44943642 0.6316672 0.6316672 ]]
my question is: how do I interpret the numbers in the matrix? I understand the 0 means that the word i.e. wizard appears 0 times in the first document therefore it's 0, but how do I interpret the number 0.75458397? Is it the frequency that the word "ate" appeared in the first document? Or the frequency of the word "ate" that occurs in the entire corpus?
Solution
TF-IDF (which means "term frequency - inverse document frequency"), is not giving you the frequency of a term in its representation.
TF-IDF gives high scores to terms occurring in only very few documents, and low scores for terms occurring in many documents, so its roughly speaking a measure of how discriminative a term is in a given document. Take a look at this resource to find an excellent description of TF-IDF and to get a better idea of what it is doing.
If you only want counts, you'd need to use CountVectorizer
.
Answered By - tttthomasssss
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.