Issue
I apply a K-mean algorithm to classify some text documents using scikit learn and display the clustering result. I would like to display the similarity of my cluster in a similarity matrix. I didn't see any tool in the scikit learn library that allows to do so.
# headlines type: <class 'numpy.ndarray'> tf-idf vectors
pca = PCA(n_components=2).fit(headlines)
data2D = pca.transform(to_headlines)
pl.scatter(data2D[:, 0], data2D[:, 1])
km = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=3, random_state=0)
km.fit(headlines)
Is there any way/library that will allow me to draw easily this cosine similarity matrix?
Solution
If I get you right, you'd like to produce a confusion matrix similar to the one shown here. However, this requires a truth
and a prediction
that can be compared to each other. Assuming that you have some gold standard for the classification of your headlines into k
groups (the truth
), you could compare this to the KMeans clustering (the prediction
).
The only problem with this is that KMeans clustering is agnostic to your truth
, meaning the cluster labels that it produces will not be matched to the labels of the gold standard groups. There is, however, a work-around for this, which is to match the kmeans labels
to the truth labels
based on the best possible match.
Here is an example of how this might work.
First, let's generate some example data - in this case 100 samples with 50 features each, sampled from 4 different (and slightly overlapping) normal distributions. The details are irrelevant; all this is supposed to do is mimic the kind of dataset you might be working with. The truth
in this case is the mean of the normal distribution that a sample was generated from.
# User input
n_samples = 100
n_features = 50
# Prep
truth = np.empty(n_samples)
data = np.empty((n_samples, n_features))
np.random.seed(42)
# Generate
for i,mu in enumerate(np.random.choice([0,1,2,3], n_samples, replace=True)):
truth[i] = mu
data[i,:] = np.random.normal(loc=mu, scale=1.5, size=n_features)
# Show
plt.imshow(data, interpolation='none')
plt.show()
Next, we can apply the PCA and KMeans.
Note that I am not sure what exactly the point of the PCA is in in your example, since you are not actually using the PCs for your KMeans, plus it is unclear what the dataset to_headlines
is, which you transform.
Here, I am transforming the input data itself and then using the PCs for the KMeans clustering. I am also using the output to illustrate the visualization that Saikat Kumar Dey suggested in a comment to your question: a scatter plot with points colored by cluster label.
# PCA
pca = PCA(n_components=2).fit(data)
data2D = pca.transform(data)
# Kmeans
km = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=3, random_state=0)
km.fit(data2D)
# Show
plt.scatter(data2D[:, 0], data2D[:, 1],
c=km.labels_, edgecolor='')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
Next, we have to find the best-matching pairs between the truth labels
we generated in the beginning (here the mu
of the sampled normal distributions) and the kmeans labels
generated by the clustering.
In this example, I simply match them such that the number of true-positive predictions is maximized. Note that this is a simplistic, quick-and-dirty solution!
If your predictions are pretty good in general and if each group is represented by a similar number of samples in your dataset, it will probably work as intended - otherwise, it may produce mis-matches/mergers and you may somewhat overestimate the quality of your clustering as a result.
Suggestions for better solutions are welcome.
# Prep
k_labels = km.labels_ # Get cluster labels
k_labels_matched = np.empty_like(k_labels)
# For each cluster label...
for k in np.unique(k_labels):
# ...find and assign the best-matching truth label
match_nums = [np.sum((k_labels==k)*(truth==t)) for t in np.unique(truth)]
k_labels_matched[k_labels==k] = np.unique(truth)[np.argmax(match_nums)]
Now that we have matched truths
and predictions
, we can finally compute and plot the confusion matrix.
# Compute confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(truth, k_labels_matched)
# Plot confusion matrix
plt.imshow(cm,interpolation='none',cmap='Blues')
for (i, j), z in np.ndenumerate(cm):
plt.text(j, i, z, ha='center', va='center')
plt.xlabel("kmeans label")
plt.ylabel("truth label")
plt.show()
Hope this helps!
Answered By - WhoIsJack
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.