Saturday, December 30, 2023

[FIXED] How to make CountVectorizer() ignore stopwords irrespective of the case

December 30, 2023 python-3.x, scikit-learn No comments

Issue

I am using Sklearn countvectorizer() like this

vectorizer = CountVectorizer(
    stop_words="english",
    lowercase=False,
    ngram_range=ngram_range,
)

I don't want to convert my text to lowercase but I want to remove all the stopwords irrespective of the case. The above code filters out the but not The or THE. I would like to filter the , THE, The. Is it possible to achieve through CountVectorizer() without changing case?

Solution

I don't think there's a simple way to override the stop word removal and only the stop word removal, but if you pass a custom analyzer, you can provide your own stop word removal.

Here's the most minimal thing I could come up with which doesn't remove any functionality from the analyzer:

from sklearn.feature_extraction.text import CountVectorizer

class CaseInsensitiveStopWordsAnalyzer:
    def set_cv(self, cv):
        self.cv = cv
    def remove_stop_words(self, stop_words, doc):
        stop_words = set(w.lower() for w in stop_words)
        return [w for w in doc if w.lower() not in stop_words]
    def __call__(self, doc):
        preprocessor = self.cv.build_preprocessor()
        tokenizer = self.cv.build_tokenizer()
        stop_words = self.cv.get_stop_words()
        ngrams = self.cv._word_ngrams
        if preprocessor is not None:
            doc = preprocessor(doc)
        if tokenizer is not None:
            doc = tokenizer(doc)
        if stop_words is not None:
            doc = self.remove_stop_words(stop_words, doc)
        if ngrams is not None:
            doc = ngrams(doc)
        return doc

analyzer = CaseInsensitiveStopWordsAnalyzer()
vectorizer = CountVectorizer(
    stop_words="english",
    lowercase=False,
    ngram_range=(1, 1),
    analyzer=analyzer,
)
analyzer.set_cv(vectorizer)

documents = ['foo Bar bar the The']
vectorizer.fit_transform(documents)
print(vectorizer.vocabulary_)

Output:

{'foo': 2, 'Bar': 0, 'bar': 1}

This can remove both the and The without changing the case of the rest of the vocabulary.

Answered By - Nick ODell

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 30, 2023

[FIXED] How to make CountVectorizer() ignore stopwords irrespective of the case

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels