Saturday, October 16, 2021

[FIXED] find columns closest to centroids - Pandas

October 16, 2021 dataframe, python, scikit-learn No comments

Issue

import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

df = pd.DataFrame(columns=["State",       "Adult", "Senior","Children"]) 
df.loc[0]             = ["California",     111,    2,       6    ] 
df.loc[1]             = ["Texas",          70,     2,       4    ] 
df.loc[2]             = ["Florida",        64,     4,       5    ] 
df.loc[3]             = ["Georgia",        25,     2,       3    ] 
df.loc[4]             = ["Alaska",         90,     1,       2    ] 
df.loc[5]             = ["Hawaii",         105,    2,       1    ] 
df.loc[6]             = ["Washington",     27,     3,       2    ] 
df.loc[7]             = ["Pennsylvania",   90,     2,       1    ] 
df.loc[8]             = ["Virginia",       63,     2,       3    ] 
df.loc[9]             = ["Arizona",        34,     2,       4    ] 
df.loc[10]            = ["Michigan",       22,     5,       2    ] 
 
kmeans = KMeans(n_clusters=4) 
y = kmeans.fit_predict(df[['Adult', 'Senior', 'Children']]) 
df['Cluster'] = y 
centers = kmeans.cluster_centers_ 
plt.scatter(df.Adult, df.Senior, c=df.Cluster) 
plt.scatter(centers[:,0],centers[:,1],color='black',marker='*',label='centroid')
plt.show()

For the Kmeans analysis broken out by state above, I would like to extract/identify elements out of each cluster, which are the closest to that cluster's centroid.

Solution

Basically: the KMeans implementation is based on Euclidean distance. To get the two closest points to each centroid we can look at the set of points that belong to each cluster, take the 2-norm of the difference between the relevant centroid, and return the two closest points:

def get_2_closest(cluster_id, df, columns, centers):
    current = df[df["Cluster"] == cluster_id][columns]
    closest = np.argsort(
        np.linalg.norm(current.to_numpy(dtype=np.float64) - centers[cluster_id], axis=1)
    )
    return current.iloc[closest[:2]]

Full example in context:

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

df = pd.DataFrame(columns=["State",       "Adult", "Senior","Children"])
df.loc[0]             = ["California",     111,    2,       6    ]
df.loc[1]             = ["Texas",          70,     2,       4    ]
df.loc[2]             = ["Florida",        64,     4,       5    ]
df.loc[3]             = ["Georgia",        25,     2,       3    ]
df.loc[4]             = ["Alaska",         90,     1,       2    ]
df.loc[5]             = ["Hawaii",         105,    2,       1    ]
df.loc[6]             = ["Washington",     27,     3,       2    ]
df.loc[7]             = ["Pennsylvania",   90,     2,       1    ]
df.loc[8]             = ["Virginia",       63,     2,       3    ]
df.loc[9]             = ["Arizona",        34,     2,       4    ]
df.loc[10]            = ["Michigan",       22,     5,       2    ]

kmeans = KMeans(n_clusters=4)
y = kmeans.fit_predict(df[["Adult", "Senior", "Children"]])
df["Cluster"] = y
centers = kmeans.cluster_centers_


def get_2_closest(cluster_id, df, columns, centers):
    current = df[df["Cluster"] == cluster_id][columns]
    closest = np.argsort(
        np.linalg.norm(current.to_numpy(dtype=np.float64) - centers[cluster_id], axis=1)
    )
    return current.iloc[closest[:2]]


_closest = pd.DataFrame(columns=['Adult', "Senior", "Children"])

for i in range(len(centers)):
    output = get_2_closest(i, df, ["Adult", "Senior", "Children"], kmeans.cluster_centers_)
    _closest = _closest.append(output)

plt.scatter(df.Adult, df.Senior, label="Original")
plt.scatter(_closest.Adult, _closest.Senior, label="2 Closest to Centroid")
plt.scatter(centers[:, 0], centers[:, 1], color="black", marker="*", label="centroid")
plt.legend()
plt.show()

Expected output:

Question raised in comment: you can get the State column back by merging the two data frames:

print(
  _closest.merge(df, left_index=True, right_index=True)['State']
)

Output:

4          Alaska
7    Pennsylvania
6      Washington
3         Georgia
2         Florida
8        Virginia
0      California
5          Hawaii

Answered By - Alexander L. Hayes

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 16, 2021

[FIXED] find columns closest to centroids - Pandas

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels