Monday, March 7, 2022

[FIXED] Can I input a pandas dataframe into "TfidfVectorizer"? If so, how do I find out how many documents are in my dataframe?

March 07, 2022 python, scikit-learn, sklearn-pandas, tfidfvectorizer, topic-modeling No comments

Issue

Here's the raw data:

Here's about the first half of the data after reading it into a pandas dataframe:

I'm trying to run TfidfVectorizer but I keep getting the following error:

ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.

I saw this post that said the error occurs when the max_df value is less than the min_df value in TfidfVectorizer. I have tried several variations where my max_df value is greater than my min_df value and still get the same error. So, I think the error might be related to how my data is stored in the pandas dataframe. Am I on the right track? If so, how do I find out how many documents I have in my dataframe? If not, how can I troubleshoot this?

Here's my code:

tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, stop_words=None)
tfidf = tfidf_vectorizer.fit_transform(df)

Also, here is the example I am working off of:

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')

In the above example, the min_df is greater than the max_df. I tried doing that exactly but got the following error:

ValueError: max_df corresponds to < documents than min_df

Solution

You should pass a column of data to the fit_transform function. Here is the example

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
words = ['trust inten other','feel comfort express view']
df = pd.DataFrame(words,columns = ['words'])
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, stop_words=None)
# right
tfidf = tfidf_vectorizer.fit_transform(df['words'])
# wrong
# tf_idf = tf_idf_vectorizer.fit_transform(df)

When you pass df to the fit_transform function, it will take ['words'] as input, instad of ['trust inten other','feel comfort express view'] as is showed in the example.

Answered By - sheep

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, March 7, 2022

[FIXED] Can I input a pandas dataframe into "TfidfVectorizer"? If so, how do I find out how many documents are in my dataframe?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels