Issue
I am doing some text classification right now using sklearn.
As first step I obviously need to use vectorizer - either CountVectorizer or TfIdfVectorizer. The issue which I want to tackle is that in my documents often times I have singular and plural forms of same word. When performing vectorization I want to 'merge' singular and plural forms and treat them as a same text feature.
Obviously I can manually pre-process texts and just replace all plural word forms with singular word forms when I know which words have this issue. But maybe there is some way to do it in a more automated way, when words which are extremely similar to each other are merged into same feature?
UPDATE.
Based on the answer provided earlier, I needed to perform a stemming. Below is a sample code which stems all words in 'review' column of a dataframe DF, which I then use in vectorization and classification. Just in case anyone finds it useful.
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
df['review_token']=df['review'].apply(lambda x : filter(None,x.split(" ")))
df['review_stemmed']=df['review_token'].apply(lambda x : [stemmer.stem(y) for y in x])
df['review_stemmed_sentence']=df['review_stemmed'].apply(lambda x : " ".join(x))
Solution
I think what you need is stemming, namely removing the endings of words that have a common root, and it's one of the basic operations in preprocessing text data.
Here's some rules for stemming and lemmatization explained: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
Answered By - user2314737
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.