Issue
I have a very large matrix (over 100k by 100K) with a calculation logic whereby each row can be calculated distinct from other rows
I want to use multiprocessing to optimize compute time (with the matrix split into 3 slices of 1/3 rows each). However it seems like multiprocessing takes longer than a single call to calculate all rows. I am changing different parts of the matrix in each process- is that the issue?
import multiprocessing, os
import time, pandas as pd, numpy as np
def mat_proc(df):
print("ID of process running worker1: {}".format(os.getpid()))
return(df+3) # simplified version of process
print('done processing')
count=5000
df = pd.DataFrame(np.random.randint(0,10,size=(3*count,3*count)),dtype='int8')
slice1=df.iloc[0:count,]
slice2=df.iloc[count:2*count,]
slice3=df.iloc[2*count:3*count,]
p1=multiprocessing.Process(target=mat_proc,args=(slice1,))
p2=multiprocessing.Process(target=mat_proc,args=(slice2,))
p3=multiprocessing.Process(target=mat_proc,args=(slice3,))
start=time.time()
print('started now')
# this is to compare the multiprocess with a single call to full matrix
#mat_proc(df)
if __name__ == '__main__':
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
finish=time.time()
print(f'total time taken {round(finish-start,2)}')
Solution
When using multiprocessing move all script parts to if __name__ == '__main__'
part. Because when each process spawns it runs your main script. So each process had to recreate dataframe, slicing, etc.
import multiprocessing, os
import time, pandas as pd, numpy as np
def mat_proc(df):
print("ID of process running worker1: {}".format(os.getpid()))
return (df + 3) # simplified version of process
print('done processing')
if __name__ == '__main__':
count = 5000
df = pd.DataFrame(np.random.randint(0, 10, size=(3 * count, 3 * count)), dtype='int8')
slice1 = df.iloc[0:count, ]
slice2 = df.iloc[count:2 * count, ]
slice3 = df.iloc[2 * count:3 * count, ]
p1 = multiprocessing.Process(target=mat_proc, args=(slice1,))
p2 = multiprocessing.Process(target=mat_proc, args=(slice2,))
p3 = multiprocessing.Process(target=mat_proc, args=(slice3,))
start = time.time()
print('started now')
# this is to compare the multiprocess with a single call to full matrix
# mat_proc(df)
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
finish = time.time()
print(f'total time taken {round(finish - start, 2)}')
And consider using multiprocessing.Pool
, it can be handy to be able to choose how many processes you want to spawn by changing single number.
Second thing, if computations are easy (as in the simplified version of process you provieded) spawning processes, sending data to it (pickling and unpickling dataframe) will take longer than those computations and multiprocessing will be slower.
Answered By - dankal444
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.