Issue
I have five text files that I input to a CountVectorizer. When specifying min_df
and max_df
to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (five text files)?
What are the differences when min_df
and max_df
are provided as integers or as floats?
The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of these two parameters. Could someone provide an explanation or example demonstrating min_df
and max_df
?
Solution
max_df
is used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:
max_df = 0.50
means "ignore terms that appear in more than 50% of the documents".max_df = 25
means "ignore terms that appear in more than 25 documents".
The default max_df
is 1.0
, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.
min_df
is used for removing terms that appear too infrequently. For example:
min_df = 0.01
means "ignore terms that appear in less than 1% of the documents".min_df = 5
means "ignore terms that appear in less than 5 documents".
The default min_df
is 1
, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.
Answered By - Kevin Markham
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.