Issue
I have a subset of a dataframe like:
<OUT>
PageNumber Top_words_only
56 people sun flower festival
75 sunflower sun architecture red buses festival
I want to calculate TF-IDF on the English_tags
df column with each row acting as a document. I have tried:
Vectorizer = TfidfVectorizer(lowercase = True, max_df = 0.8, min_df = 5, stop_words = 'english')
Vectors = Vectorizer.fit_transform(df['top_words_only'])
If I print the array it comes out as:
array([[0. , 0. , 0. , ..., 0. , 0.35588179,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ]])
But I am a little confused by what this means - why are there so many o values? Does implementing TfidfVectorizer()
automatically calculate the TF-IDF values for each tag taking into account all documents (i.e. corpus)?
Solution
Calling fit_transform
calculates a vector for each supplied document. Each vector will be the same size. The size of the vector is the number of unique words across the supplied documents. The number of zero values in the vector will be the vector size - number of unique values in the document.
Using your top_words as a simple example. You show 2 documents:
'people sun flower festival'
'sunflower sun architecture red buses festival'
These have a total of 8 unique words (Vectorizer.get_feature_names_out()
will give you these):
'architecture', 'buses', 'festival', 'flower', 'people', 'red', 'sun', 'sunflower'
Calling fit_transform
with those 2 documents will give 2 vectors (1 for each doc), each with length 8 (number of unique words across the documents).
The first document, 'people sun flower festival'
has 4 words, so, you get 4 values in the vector, and 4 zeros. Similarly 'sunflower sun architecture red buses festival'
gives 6 values and 2 zeros.
The more documents you pass in with different words, the longer the vector gets, and the more likely the zeros are.
from sklearn.feature_extraction.text import TfidfVectorizer
top_words = ['people sun flower festival', 'sunflower sun architecture red buses festival']
Vectorizer = TfidfVectorizer()
Vectors = Vectorizer.fit_transform(top_words)
print(f'Feature names: {Vectorizer.get_feature_names_out().tolist()}')
tfidf = Vectors.toarray()
print('')
print(f'top_words[0] = {top_words[0]}')
print(f'tfidf[0] = {tfidf[0].tolist()}')
print('')
print(f'top_words[1] = {top_words[1]}')
print(f'tfidf[1] = {tfidf[1].tolist()}')
The above code will print:
Feature names: ['architecture', 'buses', 'festival', 'flower', 'people', 'red', 'sun', 'sunflower']
top_words[0] = people sun flower festival
tfidf[0] = [0.0, 0.0, 0.40993714596036396, 0.5761523551647353, 0.5761523551647353, 0.0, 0.40993714596036396, 0.0]
top_words[1] = sunflower sun architecture red buses festival
tfidf[1] = [0.4466561618018052, 0.4466561618018052, 0.31779953783628945, 0.0, 0.0, 0.4466561618018052, 0.31779953783628945, 0.4466561618018052]
Answered By - pcoates
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.