Monday, January 17, 2022

[FIXED] k-means centroid labels change across runs of the same program?

January 17, 2022 k-means, pandas, pca, python, scikit-learn No comments

Issue

I observe that subsequent runs of the same program deliver different labels for the k-means clusters, although the original features are the same. The program applies a set of transformations to an original dataframe, and then to a new dataframe, the pipeline consisting of -- in this order -- StandardScaler--> PCA --> K-means. The PCA and k-means models determined on the initial data are used for the next dataset. Finally, the program does the inverse transformations so that the centroids are shown in the initial features space. So I am puzzled by the different labels, the relevant function here is k-means .predict()

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
def get_kmeans_score(data, center):
    '''
    returns the kmeans score regarding SSE for points to centers
    INPUT:
        data - the dataset you want to fit kmeans to
        center - the number of centers you want (the k value)
    OUTPUT:
        score - the SSE score for the kmeans model fit to the data
    '''
    #instantiate kmeans
    kmeans = KMeans(n_clusters=center)

    # Then fit the model to your data using the fit method
    model = kmeans.fit(data)

    # Obtain a score related to the model fit
    score = np.abs(model.score(data))

    return score
data = {
    'apples': [3, 2, 0, 9, 2, 1],
    'oranges': [0, 7.6, 7, 2, 7, 6],
    'figs':[1.4, 11, 10.999, 3.99, 10, 2],
    'pears': [5, 2, 6, 2.45, 1, 7],
    'berries': [1.3, 4, 10, 0, 5,21],
    'tomatoes': [5, 15, 3, 4, 17,5],
    'onions': [11,3, 3, 1, 0, 10]
}
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David', 'Bob', 'Karen'])
print('ORIGINAL DATA')
print(purchases)
Y1 = pd.DataFrame(np.round(purchases,0), columns = purchases.keys())
scaler = StandardScaler()
Y = scaler.fit_transform(Y1)
pca = PCA(n_components=3)
W = pca.fit_transform(Y)
# apply k-means
scores = []
centers = list(range(1,5))
for center in centers:
    scores.append(get_kmeans_score(W, center))
X = zip(centers, scores)
print('k-means results on original data as a function of # centers')
for i in X:
        print(i)
# from the above results, assume the elbow is 4 clusters
print('_________________________________________')
n_c = 4
kmeans = KMeans(n_clusters=4)
model = kmeans.fit(W)
score = np.abs(model.score(W))
print('k-means score on ', n_c, ' clusters for the original dataset = ',score)
# model is the k-means model that will also be applied to the new dataset
#
NEW_data = {
    'apples': [9, 20, 10, 2, 12,1],
    'oranges': [10, 3, 12, 1, 18, 5],
    'figs':[34, 11, 3.999, 1, 0, 12],
    'pears': [5, 2, 16, 2.45, 10, 11],
    'berries': [13, 4, 1, 2, 15, 4],
    'tomatoes': [7, 2, 1, 14, 27, 2],
    'onions': [1,10, 11, 2, 4, 10]
}
purchases_N = pd.DataFrame(NEW_data)
purchases_N = pd.DataFrame(NEW_data, index=['June', 'Robert', 'Lily', 'David', 'Bob', 'Karen'])
print('NEW DATA')
print(purchases_N)
YY1 = pd.DataFrame(np.round(purchases_N,0), columns = purchases_N.keys())
YY = scaler.fit_transform(YY1)
W1 = pca.transform(YY)
scoreNew = np.abs(model.score(W1))
print('k-means score on ', n_c, ' clusters for the new dataset = ',scoreNew)
print(scoreNew)
# k-means score the new dataset using the model determined on original ds
# predictions for the 2 datasets using the k-means model based on orig data
predict_purchases_dataset = model.predict(W)
predict_purchases_NewDataset = model.predict(W1)
print('original data upon PCA using n_components=3')
print(W)
print('k-means predictions --- original data')
print(predict_purchases_dataset)
print('_________________________________________')
print('new data upon PCA using n_components=3')
print(W1)
print('k-means predictions --- new data')
print(predict_purchases_NewDataset)
# the output matches the prediction on orig dataset:
# there are 2 customers in cluster 2, 2 customers in cluster 1, 1 in cluster 3 and 1 in 0
L = len(purchases.index)
x = [i for i in range (10)]
orig = []
NEW = []
for i in range(10):
    orig.append((predict_purchases_dataset== i).sum()/L)
    NEW.append((predict_purchases_NewDataset== i).sum()/L)
print('proportion of k-means clusters for original data')
print(orig)
print('proportion of k-means clusters for new data')
print(NEW)

#df_summary = pd.DataFrame({'cluster' : x, 'propotion_orig' : orig, 'proportion_NEW': NEW})
#df_summary.plot(x='cluster', y= ['propotion_orig','proportion_NEW' ], kind='bar')
model.cluster_centers_
#
IPCA = pca.inverse_transform(model.cluster_centers_)
APPROX = scaler.inverse_transform(IPCA)
approx_df =pd.DataFrame(APPROX, columns=purchases.columns)
print('k-means centers coordinates in original features space')
print(approx_df)

Solution

Yes, this behavior is expected from k-means due to its random initial cluster seed assignment. Of course there are different ways of assigning initial cluster seeds but by default your implementation uses kmeans++ strategy. (See init from KMeans documentation)

Answered By - kampmani

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 17, 2022

[FIXED] k-means centroid labels change across runs of the same program?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels