Issue
I am using Sklearn countvectorizer() like this
vectorizer = CountVectorizer(
stop_words="english",
lowercase=False,
ngram_range=ngram_range,
)
I don't want to convert my text to lowercase but I want to remove all the stopwords irrespective of the case. The above code filters out the
but not The
or THE
.
I would like to filter the
, THE
, The
. Is it possible to achieve through CountVectorizer() without changing case?
Solution
I don't think there's a simple way to override the stop word removal and only the stop word removal, but if you pass a custom analyzer, you can provide your own stop word removal.
Here's the most minimal thing I could come up with which doesn't remove any functionality from the analyzer:
from sklearn.feature_extraction.text import CountVectorizer
class CaseInsensitiveStopWordsAnalyzer:
def set_cv(self, cv):
self.cv = cv
def remove_stop_words(self, stop_words, doc):
stop_words = set(w.lower() for w in stop_words)
return [w for w in doc if w.lower() not in stop_words]
def __call__(self, doc):
preprocessor = self.cv.build_preprocessor()
tokenizer = self.cv.build_tokenizer()
stop_words = self.cv.get_stop_words()
ngrams = self.cv._word_ngrams
if preprocessor is not None:
doc = preprocessor(doc)
if tokenizer is not None:
doc = tokenizer(doc)
if stop_words is not None:
doc = self.remove_stop_words(stop_words, doc)
if ngrams is not None:
doc = ngrams(doc)
return doc
analyzer = CaseInsensitiveStopWordsAnalyzer()
vectorizer = CountVectorizer(
stop_words="english",
lowercase=False,
ngram_range=(1, 1),
analyzer=analyzer,
)
analyzer.set_cv(vectorizer)
documents = ['foo Bar bar the The']
vectorizer.fit_transform(documents)
print(vectorizer.vocabulary_)
Output:
{'foo': 2, 'Bar': 0, 'bar': 1}
This can remove both the
and The
without changing the case of the rest of the vocabulary.
Answered By - Nick ODell
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.