Sunday, September 4, 2022

[FIXED] Scikit Learn - K-Means - Elbow - criterion

September 04, 2022 cluster-analysis, k-means, machine-learning, python, scikit-learn No comments

Issue

Today i'm trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i'm looking for the right k... I found the elbow criterion as a method to detect the right k but i do not understand how to use it with scikit learn?! In scikit learn i'm clustering things in this way

kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10) 
kmeans.fit(data)

So should i do this several times for n_clusters = 1...n and watch at the Error rate to get the right k ? think this would be stupid and would take a lot of time?!

Solution

The elbow criterion is a visual method. I have not yet seen a robust mathematical definition of it. But k-means is a pretty crude heuristic, too.

So yes, you will need to run k-means with k=1...kmax, then plot the resulting SSQ and decide upon an "optimal" k.

There exist advanced versions of k-means such as X-means that will start with k=2 and then increase it until a secondary criterion (AIC/BIC) no longer improves. Bisecting k-means is an approach that also starts with k=2 and then repeatedly splits clusters until k=kmax. You could probably extract the interim SSQs from it.

Either way, I have the impression that in any actual use case where k-mean is really good, you do actually know the k you need beforehand. In these cases, k-means is actually not so much a "clustering" algorithm, but a vector quantization algorithm. E.g. reducing the number of colors of an image to k. (where often you would choose k to be e.g. 32, because that is then 5 bits color depth and can be stored in a bit compressed way). Or e.g. in bag-of-visual-words approaches, where you would choose the vocabulary size manually. A popular value seems to be k=1000. You then don't really care much about the quality of the "clusters", but the main point is to be able to reduce an image to a 1000 dimensional sparse vector. The performance of a 900 dimensional or a 1100 dimensional representation will not be substantially different.

For actual clustering tasks, i.e. when you want to analyze the resulting clusters manually, people usually use more advanced methods than k-means. K-means is more of a data simplification technique.

Answered By - Has QUIT--Anony-Mousse

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, September 4, 2022

[FIXED] Scikit Learn - K-Means - Elbow - criterion

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels