Wednesday, December 1, 2021

[FIXED] Multiprocessing different rows of matrix

December 01, 2021 dataframe, multiprocessing, pandas, python No comments

Issue

I have a very large matrix (over 100k by 100K) with a calculation logic whereby each row can be calculated distinct from other rows

I want to use multiprocessing to optimize compute time (with the matrix split into 3 slices of 1/3 rows each). However it seems like multiprocessing takes longer than a single call to calculate all rows. I am changing different parts of the matrix in each process- is that the issue?

import multiprocessing, os
import time, pandas as pd, numpy as np

def mat_proc(df):
    print("ID of process running worker1: {}".format(os.getpid()))
    return(df+3)  # simplified version of process  
    print('done processing')
          
count=5000

df = pd.DataFrame(np.random.randint(0,10,size=(3*count,3*count)),dtype='int8')
slice1=df.iloc[0:count,]
slice2=df.iloc[count:2*count,]
slice3=df.iloc[2*count:3*count,]

p1=multiprocessing.Process(target=mat_proc,args=(slice1,))
p2=multiprocessing.Process(target=mat_proc,args=(slice2,))
p3=multiprocessing.Process(target=mat_proc,args=(slice3,))

start=time.time()
print('started now')
# this is to compare the multiprocess with a single call to full matrix
#mat_proc(df)

if __name__ == '__main__':   
    p1.start()
    p2.start()
    p3.start()
    p1.join()
    p2.join()
    p3.join()
    
finish=time.time()
print(f'total time taken {round(finish-start,2)}')

Solution

When using multiprocessing move all script parts to if __name__ == '__main__' part. Because when each process spawns it runs your main script. So each process had to recreate dataframe, slicing, etc.

import multiprocessing, os
import time, pandas as pd, numpy as np


def mat_proc(df):
    print("ID of process running worker1: {}".format(os.getpid()))
    return (df + 3)  # simplified version of process
    print('done processing')


if __name__ == '__main__':
    count = 5000

    df = pd.DataFrame(np.random.randint(0, 10, size=(3 * count, 3 * count)), dtype='int8')
    slice1 = df.iloc[0:count, ]
    slice2 = df.iloc[count:2 * count, ]
    slice3 = df.iloc[2 * count:3 * count, ]

    p1 = multiprocessing.Process(target=mat_proc, args=(slice1,))
    p2 = multiprocessing.Process(target=mat_proc, args=(slice2,))
    p3 = multiprocessing.Process(target=mat_proc, args=(slice3,))

    start = time.time()
    print('started now')
    # this is to compare the multiprocess with a single call to full matrix
    # mat_proc(df)

    p1.start()
    p2.start()
    p3.start()
    p1.join()
    p2.join()
    p3.join()

    finish = time.time()
    print(f'total time taken {round(finish - start, 2)}')

And consider using multiprocessing.Pool, it can be handy to be able to choose how many processes you want to spawn by changing single number.

Second thing, if computations are easy (as in the simplified version of process you provieded) spawning processes, sending data to it (pickling and unpickling dataframe) will take longer than those computations and multiprocessing will be slower.

Answered By - dankal444

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 1, 2021

[FIXED] Multiprocessing different rows of matrix

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels