Issue
I want to prevent certain phrases for creeping into my models. For example, I want to prevent 'red roses' from entering into my analysis. I understand how to add individual stop words as given in Adding words to scikit-learn's CountVectorizer's stop list by doing so:
from sklearn.feature_extraction import text
additional_stop_words=['red','roses']
However, this also results in other ngrams like 'red tulips' or 'blue roses' not being detected.
I am building a TfidfVectorizer as part of my model, and I realize the processing I need might have to be entered after this stage but I am not sure how to do this.
My eventual aim is to do topic modelling on a piece of text. Here is the piece of code (borrowed almost directly from https://de.dariah.eu/tatom/topic_model_python.html#index-0 ) that I am working on:
from sklearn import decomposition
from sklearn.feature_extraction import text
additional_stop_words = ['red', 'roses']
sw = text.ENGLISH_STOP_WORDS.union(additional_stop_words)
mod_vectorizer = text.TfidfVectorizer(
ngram_range=(2,3),
stop_words=sw,
norm='l2',
min_df=5
)
dtm = mod_vectorizer.fit_transform(df[col]).toarray()
vocab = np.array(mod_vectorizer.get_feature_names())
num_topics = 5
num_top_words = 5
m_clf = decomposition.LatentDirichletAllocation(
n_topics=num_topics,
random_state=1
)
doctopic = m_clf.fit_transform(dtm)
topic_words = []
for topic in m_clf.components_:
word_idx = np.argsort(topic)[::-1][0:num_top_words]
topic_words.append([vocab[i] for i in word_idx])
doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)
for t in range(len(topic_words)):
print("Topic {}: {}".format(t, ','.join(topic_words[t][:5])))
EDIT
Sample dataframe (I have tried to insert as many edge cases as possible), df:
Content
0 I like red roses as much as I like blue tulips.
1 It would be quite unusual to see red tulips, but not RED ROSES
2 It is almost impossible to find blue roses
3 I like most red flowers, but roses are my favorite.
4 Could you buy me some red roses?
5 John loves the color red. Roses are Mary's favorite flowers.
Solution
TfidfVectorizer
allows for a custom preprocessor. You can use this to make any needed adjustments.
For example, to remove all occurrences of consecutive "red" + "roses" tokens from your example corpus (case-insensitive), use:
import re
from sklearn.feature_extraction import text
cases = ["I like red roses as much as I like blue tulips.",
"It would be quite unusual to see red tulips, but not RED ROSES",
"It is almost impossible to find blue roses",
"I like most red flowers, but roses are my favorite.",
"Could you buy me some red roses?",
"John loves the color red. Roses are Mary's favorite flowers."]
# remove_stop_phrases() is our custom preprocessing function.
def remove_stop_phrases(doc):
# note: this regex considers "... red. Roses..." as fair game for removal.
# if that's not what you want, just use ["red roses"] instead.
stop_phrases= ["red(\s?\\.?\s?)roses"]
for phrase in stop_phrases:
doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
return doc
sw = text.ENGLISH_STOP_WORDS
mod_vectorizer = text.TfidfVectorizer(
ngram_range=(2,3),
stop_words=sw,
norm='l2',
min_df=1,
preprocessor=remove_stop_phrases # define our custom preprocessor
)
dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names_out())
Now vocab
has all red roses
references removed.
print(sorted(vocab))
['Could buy',
'It impossible',
'It impossible blue',
'It quite',
'It quite unusual',
'John loves',
'John loves color',
'Mary favorite',
'Mary favorite flowers',
'blue roses',
'blue tulips',
'color Mary',
'color Mary favorite',
'favorite flowers',
'flowers roses',
'flowers roses favorite',
'impossible blue',
'impossible blue roses',
'like blue',
'like blue tulips',
'like like',
'like like blue',
'like red',
'like red flowers',
'loves color',
'loves color Mary',
'quite unusual',
'quite unusual red',
'red flowers',
'red flowers roses',
'red tulips',
'roses favorite',
'unusual red',
'unusual red tulips']
UPDATE (per comment thread):
To pass in desired stop phrases along with custom stop words to a wrapper function, use:
desired_stop_phrases = ["red(\s?\\.?\s?)roses"]
desired_stop_words = ['Could', 'buy']
def wrapper(stop_words, stop_phrases):
def remove_stop_phrases(doc):
for phrase in stop_phrases:
doc = re.sub(phrase, "", doc, flags=re.IGNORECASE)
return doc
sw = text.ENGLISH_STOP_WORDS.union(stop_words)
mod_vectorizer = text.TfidfVectorizer(
ngram_range=(2,3),
stop_words=sw,
norm='l2',
min_df=1,
preprocessor=remove_stop_phrases
)
dtm = mod_vectorizer.fit_transform(cases).toarray()
vocab = np.array(mod_vectorizer.get_feature_names_out())
return vocab
wrapper(desired_stop_words, desired_stop_phrases)
Answered By - andrew_reece
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.