Issue
I've observed that setting a random seed before using multiprocessing in python causes strange behaviour.
In python 3.5.2, only 2 or 3 cores are used with a low percentage of used CPU. In python 2.7.13, all requested cores are used at 100%, but the code seems to never finish. When I remove the initialization of the random seed, the parallelization works fine.
This happens even though there is not an explicit use of random in the parallelized function. I now assume the seed is shared among processes and that prevents the smooth running of multiprocessing, but can someone provide the correct answer?
I've run the code on Linux and here is a minimal code example :
from multiprocessing import Pool
import numpy as np
import random
random.seed = 2018
NB_CPUS = 4
def test(x):
return x**2
pool = Pool(NB_CPUS)
args = [np.random.rand() for _ in range(100000)]
results = pool.map(test, args)
pool.terminate()
results[-5:]
Solution
Bit late with an answer, but you're breaking things by setting the random.seed
function to an int
. You should instead be doing:
random.seed(2018)
the last few lines of traceback provide the context that should have made this obvious:
File "/usr/lib64/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "/usr/lib64/python2.7/multiprocessing/forking.py", line 125, in __init__
random.seed()
TypeError: 'int' object is not callable
this causes Pool
to keep trying to create new worker processes, but because this happens every time no forward progress can be made.
The behind this is that multiprocessing
knows it should re-seed the random module when forking so that child processes don't share the same RNG state. To do this it tries to call the random.seed
function, but you've set it to an int
which isn't callable --- hence the error!
Another issue related to this is that multiprocessing
doesn't know to re-seed the NumPy RNG, so the following code:
from multiprocessing import Pool
import numpy as np
def test(i):
print(i, np.random.rand())
with Pool(4) as pool:
pool.map(test, range(4))
will cause each worker to print the same value. This issue has been known for a while, but is still open. You can work around this by using a worker initializer
, e.g:
def initfn():
np.random.seed()
with Pool(4, initializer=initfn) as pool:
pool.map(test, range(4))
will now cause the above test
function to print different values. Note that you could even use Pool(4, initializer=np.random.seed)
if you're not doing any other work level initialization.
Answered By - Sam Mason
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.