Issue
I have a trained sklearn
's CountVectorizer
object on some corpus. When vectorizing a new document, the vector contains only the tokens that appear in the vectorizer's vocabulary.
I'd like to add another feature to the vector which is the vocabulary coverage, or in other words, the percentage of tokens that are in the vocabulary.
Here's my code:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"good morning sunshine",
"hello world",
"hello sunshine",
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus)
def get_vocab_coverage(vectorizer, sent):
preprocessor = vectorizer.build_preprocessor()
tokenizer = vectorizer.build_tokenizer()
processed = preprocessor(sent)
tokenized_license = tokenizer(processed)
count = sum(w in vectorizer.vocabulary_ for w in tokenized_license)
return count / len(tokenized_license)
get_vocab_coverage(vectorizer, "hello world") # => 1.0
get_vocab_coverage(vectorizer, "hello to you") # => 0.333
The problem with this code is that it's not very pythonic, it uses an internal sklearn's variable and is not very scalable. Any idea how I can improve it? or is there an existing method that does the same?
Solution
transform
method of CountVectorizer
may be useful:
def get_vocab_coverage(vectorizer, sent: str):
preprocessor = vectorizer.build_preprocessor()
tokenizer = vectorizer.build_tokenizer()
processed = preprocessor(sent)
tokenized_license = tokenizer(processed)
if len(tokenized_license):
return vectorizer.transform([sent]).sum() / len(tokenized_license)
return 0
Note wrapping sample string to list before passing to transform
(because it expects batch of documents, not one string), and also handling len == 0
case (no need to catch error dividing by zero). Few things to point out:
- method
transform
returns sparse vector, sosum
works fast CountVectorizer
make no advantage of passing documents by batches totransform
, inside there is ordinary for-cycle, so functionget_vocab_coverage
written for processing one example is enoughtransform
expects raw documents, so function tokenize one example twice (explicitly while creatingtokenized_license
and implicitly insidetransform
), and it seems there is no simple way to avoid it (only return to previous version); if that's critical consider rewriting_count_vocab
method withpreprocessed_documents
argument instead ofraw_documents
Answered By - draw
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.