Wednesday, October 26, 2022

[FIXED] Text vocabulary coverage using sklearn

October 26, 2022 nlp, python, scikit-learn No comments

Issue

I have a trained sklearn's CountVectorizer object on some corpus. When vectorizing a new document, the vector contains only the tokens that appear in the vectorizer's vocabulary. I'd like to add another feature to the vector which is the vocabulary coverage, or in other words, the percentage of tokens that are in the vocabulary.

Here's my code:

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    "good morning sunshine",
    "hello world",
    "hello sunshine",
]

vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus)

def get_vocab_coverage(vectorizer, sent):

    preprocessor = vectorizer.build_preprocessor()
    tokenizer = vectorizer.build_tokenizer()

    processed = preprocessor(sent)
    tokenized_license = tokenizer(processed)
    count = sum(w in vectorizer.vocabulary_ for w in tokenized_license)
    
    return count / len(tokenized_license)

get_vocab_coverage(vectorizer, "hello world") # => 1.0
get_vocab_coverage(vectorizer, "hello to you") # => 0.333

The problem with this code is that it's not very pythonic, it uses an internal sklearn's variable and is not very scalable. Any idea how I can improve it? or is there an existing method that does the same?

Solution

transform method of CountVectorizer may be useful:

def get_vocab_coverage(vectorizer, sent: str):

  preprocessor = vectorizer.build_preprocessor()
  tokenizer = vectorizer.build_tokenizer()

  processed = preprocessor(sent)
  tokenized_license = tokenizer(processed)

  if len(tokenized_license):
    return vectorizer.transform([sent]).sum() / len(tokenized_license)

  return 0

Note wrapping sample string to list before passing to transform (because it expects batch of documents, not one string), and also handling len == 0 case (no need to catch error dividing by zero). Few things to point out:

method transform returns sparse vector, so sum works fast
CountVectorizer make no advantage of passing documents by batches to transform, inside there is ordinary for-cycle, so function get_vocab_coverage written for processing one example is enough
transform expects raw documents, so function tokenize one example twice (explicitly while creating tokenized_license and implicitly inside transform), and it seems there is no simple way to avoid it (only return to previous version); if that's critical consider rewriting _count_vocab method with preprocessed_documents argument instead of raw_documents

Answered By - draw

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 26, 2022

[FIXED] Text vocabulary coverage using sklearn

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels