Issue
I am writing a custom transformer in scikit-learn that adds cluster labels as a new column using stock KMeans to pandas dataframe. The custom transformer should fit to existing data then transform the unseen data by adding the a new column with the index name 'Cluster' and return a new dataframe with the additional column without modifying the original dataframe. Below is the code that that I came up with:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
class AddClustersFeature(BaseEstimator, TransformerMixin):
def __init__(self, clusters = 10):
self.clusters = clusters
self.model = KMeans(n_clusters = self.clusters)
def fit(self, X):
self.X=X
self.model.fit (self.X)
return self.model
def transform(self, X):
self.X=X
X_=X.copy() # avoiding modification of the original df
X_['Clusters'] = self.model.transform(self.X_).labels_
return X_
cluster_enc_tr_data = AddClustersFeature().fit_transform(enc_tr_data)
cluster_enc_tr_data
Unfortunately the code does work properly. The result is a dataframe with cluster numbers as column indices, with row numbers and unknown previously values. Any help or tips will greatly be appreciated.
Update 23 of June 21 v2: Please see below the code after implementing Ben's revised comments. It works perfectly now.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
class AddClustersFeature(BaseEstimator, TransformerMixin):
def __init__(self, clusters = 10):
self.clusters = clusters
def fit(self, X):
self.X=X
self.model = KMeans(n_clusters = self.clusters)
self.model.fit (self.X)
return self
def transform(self, X):
self.X=X
X_=X.copy() # avoiding modification of the original df
X_['Clusters'] = self.model.predict(X_)
return X_
cluster_enc_tr_data = AddClustersFeature().fit_transform(enc_tr_data)
Solution
The fit
method must always return self
.
The problem here is that fit_transform(X, y)
, inherited from TransformerMixin
, is just fit(X, y).transform(X)
; your fit
now returns the underlying KMeans
transformer, and that is used to transform X
instead of your transform
.
A few more notes though:
KMeans.transform
gives the cluster-distance matrix, but you want the cluster labels. Usepredict
instead. And droplabels_
, so justX_['Clusters'] = self.model.predict(X_)
.)__init__
should only set attributes that appear in its signature, in order for cloning to work (required for e.g. hyperparameter searches). You can defineself.model
atfit
time.in
transform
, you useself.X_
but it is never defined; I guess you mean justX_
. There no real reason to saveX
at fit time either;self.X
is never really needed?This will only work on dataframes; that may not be a problem for you, but keep it in mind. (You can't use this as a step in a pipeline after builtin
sklearn
transformers, because those will return numpy arrays.)
Answered By - Ben Reiniger
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.