Issue
I am using scikit-learn. I want to cluster a 6gb dataset of documents and find clusters of documents.
I only have about 4Gb ram though. Is there a way to get k-means to handle large datasets in scikit-learn?
Thank you, Please let me know if you have any questions.
Solution
Use MiniBatchKMeans
together with HashingVectorizer
; that way, you can learn a cluster model in a single pass over the data, assigning cluster labels as you go or in a second pass. There's an example script that demonstrates MBKM.
Answered By - Fred Foo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.