Issue
I want to group 10 stores into 6 clusters but I have these data in multiple years.
I tried KMeans from sklearn.cluster but I am under impression that it's good for one period only. I came across K-means and Dynamic Time Wrapping https://tslearn.readthedocs.io/en/stable/user_guide/clustering.html and tested on it, but I am having hard time understanding how should I restructure the data and/or the steps required to do prior to running the code.
So my questions are:
- By using KMeans from sklearn.cluster, how can I/Is there a way to apply clustering to data series data
- By using TimeSeriesKMeans from tslearn.clustering, how should I/what would be the correct data structure before applying this algorithm?
This is the dataframe - I have store 1 to 10 for the year of 2021 and 2022. The goal is group these 10 stores into 6 clusters based on all period. In read data, I have more than 150 stores for 20 years.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from tslearn.clustering import TimeSeriesKMeans
df_full = pd.DataFrame({'year':[2021,2021,2021,2021,2021,2021,2021,2021,2021,2021,
2022,2022,2022,2022,2022,2022,2022,2022,2022,2022],
'store':['store1','store2','store3','store4','store5','store6','store7','store8','store9','store10',
'store1','store2','store3','store4','store5','store6','store7','store8','store9','store10'],
'points': [18, 33, 19, 14, 14, 11, 20, 28, 30, 31,
35, 33, 29, 25, 25, 27, 29, 30, 19, 23],
'assists': [3, 3, 4, 5, 4, 7, 8, 7, 6, 9, 12, 14,
5, 9, 4, 3, 4, 12, 15, 11],
'rebounds': [15, 14, 14, 10, 8, 14, 13, 9, 5, 4,
11, 6, 5, 5, 3, 8, 12, 7, 6, 5]})
Below I tried to use sklearn Kmeans to group 10 stores into 6 clusters for the year of 2021, but I need to apply the clustering to both 2021 and 2022 data.
# For a single year
df = df_full[df_full['year']==2021].copy()
# Make year and store as index before applying cluster
df.set_index(['year','store'], inplace=True)
scaled_df = StandardScaler().fit_transform(df)
kmeans_kwargs = {
"init": "random",
"n_init": 1,
"random_state": 1}
#create list to hold SSE values for each k
sse = []
for k in range(2, 8):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(scaled_df)
sse.append(kmeans.inertia_)
#instantiate the k-means class, using optimal number of clusters
kmeans = KMeans(init="random", n_clusters=6 ,n_init=1, random_state=1)
#fit k-means algorithm to data
kmeans.fit(scaled_df)
#view cluster assignments for each observation
kmeans.labels_
df['cluster'] = kmeans.labels_
print(df)
And then I tried to use k-means and Dynamic Time Warping with tslearn. The result may not make sense because each store may be assigned to a different cluster in a different year. How should I restructure the data before applying this algorithm or what would be the pre-processing steps?
df_dtw = df_full.set_index(['year','store'])
model = TimeSeriesKMeans(n_clusters=6, metric="dtw",
max_iter=10, random_state=1)
model.fit(df_dtw)
df_dtw['cluster'] = model.labels_
print(df_dtw)
Solution
You can pivot your original dataframe, where you take store as an index, put points, assists and rebounds in column broken down by year, then run the cluster by using sklearn Kmeans. In this case, you still have one record (store) in a row, where columns show the value per year for points, assists and rebounds, respectively.
Answered By - user20013032
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.