Issue
I have been playing around with sklearn's k-means clustering class and I am confused about its predict method.
I have applied a model on the iris dataset like so:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
pca = PCA(n_components = 2).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
kmeans_pca = KMeans(n_clusters=3).fit(X_train_pca)
And have made predictions:
pred = kmeans_pca.predict(X_test_pca)
print(classification_report(y_test, pred))
precision recall f1-score support
0 1.00 1.00 1.00 19
1 0.76 0.87 0.81 15
2 0.86 0.75 0.80 16
accuracy 0.88 50
macro avg 0.87 0.87 0.87 50
weighted avg 0.88 0.88 0.88 50
The predictions seem adecuate, which has confused me as I have not passed in labels to the training set. I have read this post What is the use of predict() method in kmeans implementation of scikit learn? which tells me that the predict method is calling the closest cluster centroid to the test data. However, I don't know how sklearn correctly assigns the IDs during the training stage (i.e. kmeans_pca.labels_ to respective y_train) in the first place as the training stage does not involve labels.
I realise that k-means is not used for classification tasks, but I would like to know how these results were achieved. With this, what purpose could .predict() serve when performing k-means clustering in sklearn?
Solution
The KMeans clustering code assigns each data point to one of the K clusters that you have specified while fitting the KMeans clustering model. This means that it can randomly assign cluster ids to the data points in different runs, although the cluster id assigned to points belonging to the same cluster would remain the same.
E.g., for this example, consider that the cluster ids (labels) assigned to your data were - [1 1 0 0 2 2 2]
for K=3, in the next run, they could have been [0 0 2 2 1 1 1]
. Note that the cluster ids have changed, even though the points belonging to the same cluster have been assigned the same cluster-id.
In your case, during prediction, the model assigned the same cluster ids, although there could have been 6 different ways this could have gone since there are 3 clusters, and the total number of ways in which different allocation of cluster ids could be would be 6.
This was my output from doing the prediction on the KMeans clustering algorithm trained on the IRIS data.
print(classification_report(y_test, pred))
precision recall f1-score support
0 0.00 0.00 0.00 19
1 0.00 0.00 0.00 15
2 0.92 0.75 0.83 16
accuracy 0.24 50
macro avg 0.31 0.25 0.28 50
weighted avg 0.30 0.24 0.26 50
As you can see, only the points belonging to cluster-id 2 were assigned the correct cluster as they were learnt during training and it misclassified for the remaining two clusters taking the overall accuracy low.
Answered By - Aditya
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.