Wednesday, January 26, 2022

[FIXED] How to prepare pandas string data table for sklearn clustering algorithm?

January 26, 2022 cluster-analysis, data-science, k-means, python, scikit-learn No comments

Issue

Groups	Role	User	Occurences
GUS	DEFAULT_M	PASTYP	47251
RSS	DEFAULT_R	PASTYP	27057
RRD	DEFAULT_M	DANART	21251
NBD	DEFAULT_R	BONEE	17933
GTS	DEFAULT_Q	BONEE	16067

I have about 5000 rows of data like this one above and I am trying to make a clustering algorithm to know which users belong to certain group. It will make a clusters of groups containing the users. When I tried to use sklearn library to make the clustering algorithm, unfortunately it tells me that data needs to be int or float. It can not find distance between these words. Is there way that I can still use the sklearn k-means algorithm on these string data frame to cluster user groups? The other way would be to convert groups and users to numbers and it will take a long time and I need to keep a dictionary of groups and users. If I were to do so, is there an easier way to convert the groups and users to numbers so that clustering algorithm can interpret? Thanks for your help in advance Clustering graph

Solution

As I know, every algo works on numerics, or converts text to numerics, and then does it's job. Maybe you can try this.

import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
    
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))

Result:

- *LDPELDKSL:* LDPELDKSL
 - *DFKLKSLFD:* DFKLKSLFD
 - *XYZ:* ABC, XYZ
 - *DLFKFKDLD:* DLFKFKDLD

Or...

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print("\n")
print("Prediction")

Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)

Result...Top terms per cluster:

Cluster 0:
 kitten
 belly
 squooshy
 merley
 best
 eating
 google
 feedback
 face
 extension
Cluster 1:
 impressed
 map
 feedback
 google
 ve
 eating
 face
 extension
 climbing
 key
Cluster 2:
 climbing
 ninja
 cat
 eating
 impressed
 google
 feedback
 face
 extension
 ve
Cluster 3:
 eating
 kitty
 little
 came
 restaurant
 play
 ve
 feedback
 face
 extension
Cluster 4:
 100
 open
 tab
 smiley
 face
 google
 feedback
 extension
 eating
 climbing
Cluster 5:
 chrome
 extension
 promoter
 key
 google
 eating
 impressed
 feedback
 face
 ve
Cluster 6:
 translate
 app
 incredible
 google
 eating
 impressed
 feedback
 face
 extension
 ve
Cluster 7:
 ve
 taken
 photo
 best
 cat
 eating
 google
 feedback
 face
 extension

Answered By - ASH

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 26, 2022

[FIXED] How to prepare pandas string data table for sklearn clustering algorithm?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels