Issue
I want to calculate the Calinski-Harabasz index for a large number of datasets. A quick test showed that R's clusterCrit implementation in intCriteria is much slower than the corresponding function from Python' sklearn. Here is the test case (I can share the test.tsv if needed).
import numpy as np
import time
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabaz_score
d = np.loadtxt('test.tsv', delimiter='\t')
km = KMeans(n_clusters=2, max_iter=10000)
k = km.fit(d)
start = time.time()
ch = calinski_harabaz_score(d, k.labels_)
end = time.time()
print 'CH:',ch,'time:',(end - start)
Run it (using Python 2.7)
python CH.py
#CH: 482.766811373 time: 0.434059858322
Do the same in R
library(clusterCrit)
d <- as.matrix(read.table('test.tsv', sep='\t'))
k <- kmeans(d, 2, iter.max = 10000, nstart=10)
start <- Sys.time()
ch <- intCriteria(d, k$cluster, 'Calinski_Harabasz')
end <- Sys.time()
cat('CH:', ch[[1]], 'time:',end-start)
Run it in R (3.4.4)
source('CH.R')
# CH: 482.7726 time: 1.770816
I also tried using calinhara function from the fpc package but that's also quite slow.
Any way I can improve the speed of Calinski-Harabasz (and possibly other cluster validity indices) R?
Solution
Pure R is often pretty slow because of the interpreter.
To see this, compare dbscan from the fpc
and dbscan
packages.
If you want an R module to be fast, rewrite the code in Fortran or C.
The same also largely applies to Python (although the Python interpreter appears to be slightly faster than R's). But in many cases the workhorse is numpy
code which is low level optimized. And in other cases, sklearn includes cython
modules, that is a subset of Python that can be compiled into C and then to native code.
Answered By - Has QUIT--Anony-Mousse
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.