Issue
Is there a way to obtain the relative frequency matrix starting from the absolute frequency matrix (obtained with the CountVectorizer method)? This is the code used:
body = [
'the quick brown fox',
'the slow brown dog',
'the quick red dog',
'the lazy yellow fox'
]
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
bag_of_words = vectorizer.fit_transform(body)
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)
My goal is to use the function fit_transform()
(in the last row of my code) not with the absolute frequency matrix but with the relative frequency matrix. In particular, I would like to find a way to divide each row of the matrix bag_of_words
by the sum of the row itself. This is not immediate for me as the matrix is sparse.
Any advice or suggestion is appreciated. Thank you.
Solution
This can be done using TfidfVectorizer
instead of CountVectorizer
. However, this requires changing the following default parameters:
- you can remove the "idf" part of the tfidf vectorizer, leaving only term frequency
- by default, the counts are normalized by the L2 norm, what you want here (normalizing by the sum of all counts) is the L1 norm
In practice, it would look like this:
from sklearn.feature_extraction.text import TfidfVectorizer
body = [
'the quick brown fox',
'the slow brown dog',
'the quick red dog',
'the lazy yellow fox'
]
vectorizer = TfidfVectorizer(use_idf=False, norm="l1")
X = vectorizer.fit_transform(body)
print(vectorizer.get_feature_names())
This will return:
array([[0.25, 0. , 0.25, 0. , 0.25, 0. , 0. , 0.25, 0. ],
[0.25, 0.25, 0. , 0. , 0. , 0. , 0.25, 0.25, 0. ],
[0. , 0.25, 0. , 0. , 0.25, 0.25, 0. , 0.25, 0. ],
[0. , 0. , 0.25, 0.25, 0. , 0. , 0. , 0.25, 0.25]])
['brown', 'dog', 'fox', 'lazy', 'quick', 'red', 'slow', 'the', 'yellow']
Answered By - MaximeKan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.