Issue
I can't understand how the n_jobs works :
data, labels = sklearn.datasets.make_blobs(n_samples=1000, n_features=416, centers=20)
k_means = sklearn.cluster.KMeans(n_clusters=10, max_iter=3, n_jobs=1).fit(data)
runs in less than 1sec
with n_jobs = 2, it runs nearly twice as much
with n_jobs = 8, it is so long it never ended on my computer... (I have 8 cores)
Is there something I don't understand with how parallelization works ?
Solution
n_jobs
specifies the number of concurrent processes/threads should be used for parallelized routines
From docs
Some parallelism uses a multi-threading backend by default, some a multi-processing backend. It is possible to override the default backend by using sklearn.utils.parallel_backend.
With python GIL, more threads does not guarantee better speed. So check if your backend is configured for threads or processes. If it is threads then try changing it to processes (but you will also have the overhead of IPC).
Again from the docs:
Whether parallel processing is helpful at improving runtime depends on many factors, and it’s usually a good idea to experiment rather than assuming that increasing the number of jobs is always a good thing. It can be highly detrimental to performance to run multiple copies of some estimators or functions in parallel.
So n_jobs
is not a silver bullet but one has to experiment to see if it works for their estimators and kind of data.
Answered By - mujjiga
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.