Issue
I want to find optimal k for using k-means on a data set. I use code below:
Sum_of_squared_distances = []
for k in range(1,15):
print(k)
Sum_of_squared_distances.append(KMeans(n_clusters=k).fit(x).inertia_)
plt.plot(range(1,15), Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.xticks(range(1,15))
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.savefig('optimal-k.jpg')
plt.show()
as you can see for some values of k, higher k has worse result. I want to know is this possible or I am doing something wrong? some in depth explanation would be appreciated.
Solution
It is possible yes.
K-means is not exactly solved for a given k
-- it's an NP-Hard problem. What you see are the locally optimum results that are achieved through a heuristic-based optimization algorithm.
To be precise, if you were to be able to exactly solve it for a given k
, you would see a graph strictly going down. But because you can't, you see a pattern as in your plot.
The algorithm's result depends on the initial random start point, so if you ran each k
many times (i.e. increase the n_init
parameter in KMeans
), you'd see a graph more smoothly decreasing, because then you'd remove some of the noise.
Answered By - Cihan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.