Issue
Groups | Role | User | Occurences |
---|---|---|---|
GUS | DEFAULT_M | PASTYP | 47251 |
RSS | DEFAULT_R | PASTYP | 27057 |
RRD | DEFAULT_M | DANART | 21251 |
NBD | DEFAULT_R | BONEE | 17933 |
GTS | DEFAULT_Q | BONEE | 16067 |
I have about 5000 rows of data like this one above and I am trying to make a clustering algorithm to know which users belong to certain group. It will make a clusters of groups containing the users. When I tried to use sklearn library to make the clustering algorithm, unfortunately it tells me that data needs to be int or float. It can not find distance between these words. Is there way that I can still use the sklearn k-means algorithm on these string data frame to cluster user groups? The other way would be to convert groups and users to numbers and it will take a long time and I need to keep a dictionary of groups and users. If I were to do so, is there an easier way to convert the groups and users to numbers so that clustering algorithm can interpret? Thanks for your help in advance
Solution
As I know, every algo works on numerics, or converts text to numerics, and then does it's job. Maybe you can try this.
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Result:
- *LDPELDKSL:* LDPELDKSL
- *DFKLKSLFD:* DFKLKSLFD
- *XYZ:* ABC, XYZ
- *DLFKFKDLD:* DLFKFKDLD
Or...
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
print("\n")
print("Prediction")
Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)
Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)
Result...Top terms per cluster:
Cluster 0:
kitten
belly
squooshy
merley
best
eating
google
feedback
face
extension
Cluster 1:
impressed
map
feedback
google
ve
eating
face
extension
climbing
key
Cluster 2:
climbing
ninja
cat
eating
impressed
google
feedback
face
extension
ve
Cluster 3:
eating
kitty
little
came
restaurant
play
ve
feedback
face
extension
Cluster 4:
100
open
tab
smiley
face
google
feedback
extension
eating
climbing
Cluster 5:
chrome
extension
promoter
key
google
eating
impressed
feedback
face
ve
Cluster 6:
translate
app
incredible
google
eating
impressed
feedback
face
extension
ve
Cluster 7:
ve
taken
photo
best
cat
eating
google
feedback
face
extension
Answered By - ASH
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.