Issue
I am trying to understand how to create clustering of texts using sklearn. I have 800 hundred texts (600 training data and 200 test data) like the following:
Texts # columns name
1 Donald Trump, Donald Trump news, Trump bleach, Trump injected bleach, bleach coronavirus.
2 Thank you Janey.......laughing so much at this........you have saved my sanity in these mad times. Only bleach Trump is using is on his heed 🤣
3 His more uncharitable critics said Trump had suggested that Americans drink bleach. Trump responded that he was being sarcastic.
4 Outcry after Trump suggests injecting disinfectant as treatment.
5 Trump Suggested 'Injecting' Disinfectant to Cure Coronavirus?
6 The study also showed that bleach and isopropyl alcohol killed the virus in saliva or respiratory fluids in a matter of minutes.
and I would like create clusters from those.
To transform the corpus into vector space I have used tf-idf
and to cluster the documents using the k-means algorithm.
However, I cannot understand if the results are those expected or not as unfortunately the output is not 'graphical' (I have tried to use CountVectorizer to have a matrix of frequency, but probably I am using it in the wrong way).
What I would expect by doing tf-idf is that when I test the test dataset
When I TEST:
test_dataset = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.", "Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19", "Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus."]
(the test dataset comes from the column df["0"]['Names']
)
I would like to see which cluster(made by k-means) the texts belongs to.
Please see below the code that I am currently using:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
def preprocessing(line):
line = re.sub(r"[^a-zA-Z]", " ", line.lower())
words = word_tokenize(line)
words_lemmed = [WordNetLemmatizer().lemmatize(w) for w in words if w not in stop_words]
return words_lemmed
tfidf_vectorizer = TfidfVectorizer(tokenizer=preprocessing)
vec = CountVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df["0"]['Names'])
matrix = vec.fit_transform(df["0"]['Names'])
kmeans = KMeans(n_clusters=2).fit(tfidf)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
where df["0"]['Names']
is the column 'Names
' of the 0th
dataframe.
A visual example, even with a different dataset but pretty same structure of dataframe (just for a better understanding) would be also good, if you prefer.
All the help you will provide will be greatly appreciated. Thanks
Solution
taking your test_data and adding three more sentence to make corpus
train_data = ["'Please don't inject bleach': Trump's wild coronavirus claims prompt disbelief.",
"Donald Trump has won the shock and ire of the scientific and medical communities after suggesting bogus treatments for Covid-19",
"Bleach manufacturers have warned people not to inject themselves with disinfectant after Trump falsely suggested it might cure the coronavirus.",
"find the most representative document for each topic",
"topic distribution across documents",
"to help with understanding the topic",
"one of the practical application of topic modeling is to determine"]
creating dataframe from above dataset
df = pd.DataFrame(train_data, columns = 'text')
now you can use either Countvectorizer or TfidfVectorizer for vectorizing text, i am using TfidfVectorizer
vect = TfidfVectorizer(tokenizer=preprocessing)
vectorized_text = vect.fit_transform(df['text'])
kmeans = KMeans(n_clusters=2).fit(vectorized_text)
# now predicting the cluster for given dataset
df['predicted cluster'] = kmeans.predict(vectorized_text)
Now, when you are going to predict for test data or new data
new_sent = 'coronavirus has created lot of problem in the world'
kmeans.predict(vect.transform([new_sent])) #you have to use transform only and not fit_transfrom
#op
array([1])
Answered By - qaiser
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.