Friday, January 28, 2022

[FIXED] Term relative frequency matrix from CountVectorizer

January 28, 2022 countvectorizer, python, scikit-learn, scipy No comments

Issue

Is there a way to obtain the relative frequency matrix starting from the absolute frequency matrix (obtained with the CountVectorizer method)? This is the code used:

body = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
bag_of_words = vectorizer.fit_transform(body)

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=2)
lsa = svd.fit_transform(bag_of_words)

My goal is to use the function fit_transform() (in the last row of my code) not with the absolute frequency matrix but with the relative frequency matrix. In particular, I would like to find a way to divide each row of the matrix bag_of_words by the sum of the row itself. This is not immediate for me as the matrix is sparse.

Any advice or suggestion is appreciated. Thank you.

Solution

This can be done using TfidfVectorizer instead of CountVectorizer. However, this requires changing the following default parameters:

you can remove the "idf" part of the tfidf vectorizer, leaving only term frequency
by default, the counts are normalized by the L2 norm, what you want here (normalizing by the sum of all counts) is the L1 norm

In practice, it would look like this:

from sklearn.feature_extraction.text import TfidfVectorizer
body = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]
vectorizer = TfidfVectorizer(use_idf=False, norm="l1")
X = vectorizer.fit_transform(body)
print(vectorizer.get_feature_names())

This will return:

array([[0.25, 0.  , 0.25, 0.  , 0.25, 0.  , 0.  , 0.25, 0.  ],
       [0.25, 0.25, 0.  , 0.  , 0.  , 0.  , 0.25, 0.25, 0.  ],
       [0.  , 0.25, 0.  , 0.  , 0.25, 0.25, 0.  , 0.25, 0.  ],
       [0.  , 0.  , 0.25, 0.25, 0.  , 0.  , 0.  , 0.25, 0.25]])

['brown', 'dog', 'fox', 'lazy', 'quick', 'red', 'slow', 'the', 'yellow']

Answered By - MaximeKan

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 28, 2022

[FIXED] Term relative frequency matrix from CountVectorizer

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels