Issue
I am using DBSCAN to cluster some data using Scikit-Learn (Python 2.7):
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(random_state=0)
dbscan.fit(X)
However, I found that there was no built-in function (aside from "fit_predict") that could assign the new data points, Y, to the clusters identified in the original data, X. The K-means method has a "predict" function but I want to be able to do the same with DBSCAN. Something like this:
dbscan.predict(X, Y)
So that the density can be inferred from X but the return values (cluster assignments/labels) are only for Y. From what I can tell, this capability is available in R so I assume that it is also somehow available in Python. I just can't seem to find any documentation for this.
Also, I have tried searching for reasons as to why DBSCAN may not be used for labeling new data but I haven't found any justifications.
Solution
Clustering is not classification.
Clustering is unlabeled. If you want to squeeze it into a prediction mindset (which is not the best idea), then it essentially predicts without learning. Because there is no labeled training data available for clustering. It has to make up new labels for the data, based on what it sees. But you can't do this on a single instance, you can only "bulk predict".
But there is something wrong with scipys DBSCAN:
random_state
: numpy.RandomState, optional :The generator used to initialize the centers. Defaults to numpy.random.
DBSCAN does not "initialize the centers", because there are no centers in DBSCAN.
Pretty much the only clustering algorithm where you can assign new points to the old clusters is k-means (and its many variations). Because it performs a "1NN classification" using the previous iterations cluster centers, then updates the centers. But most algorithms don't work like k-means, so you can't copy this.
If you want to classify new points, it is best to train a classifier on your clustering result.
What the R version maybe is doing, is using a 1NN classificator for prediction; maybe with the extra rule that points are assigned the noise label, if their 1NN distance is larger than epsilon, mabye also using the core points only. Maybe not.
Get the DBSCAN paper, it does not discuss "prediction" IIRC.
Answered By - Has QUIT--Anony-Mousse
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.