Issue
My data frame is
df = pd.DataFrame({"person":['A','B','C'],"url":["google.kr","stackoverflow.com","yahoo.us"],"weight":[5,10,15]})
df2 = df.loc[np.repeat(df.index.values,df.weight)]
tfidf is:
v = TfidfVectorizer(token_pattern='\S+',smooth_idf=False,norm=None)
x = v.fit_transform(df2['url'])
and I extract idf using
v.idf_
which give idf in nice formatted array.
I am struggling with extracting either tfidf or only tf.
# for tfidf
x.toarray()
# for tf
v = CountVectorizer(token_pattern='\S+')
x = v.fit_transform(df2['url'])
x.toarray()
This gives me a array containing only 0,1.
Solution
For the term frequency (TF), the count vectorizer is counting the occurrences of each token in every row of your df2
. In other words, when you look at df2
you see that a single token would be present in each row, and that is why it is 1 or 0. So, you must sum each columns counts to get term frequency across the entire dataframe, which I assume is what you are after?
# for tf
v = CountVectorizer(token_pattern='\S+')
x = v.fit_transform(df2['url'])
# gets term frequencies for each token
[sum(l) for l in list(zip(*x.toarray()))]
# see which tokens are which element
v.vocabulary_
Output
[5, 10, 15]
{'google.kr': 0, 'stackoverflow.com': 1, 'yahoo.us': 2}
The zip
method just unnests each 'column' from the array. You can learn more here if you have not seen it before.
And the IDF you already know so TF-IDF is tf * idf
.
Answered By - fam-woodpecker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.