Issue
I work with well log data that has different values related to different depths. Very often I need to iterate over the depths, performing calculations, and storing these values in a previously created array to create new well logs. It is something equivalent to the example below.
import numpy as np
depth = np.linspace(5000,6000,2001)
y1 = np.random.random((len(depth),3))
y2 = np.random.random(len(depth))
def fun_1(y1, y2):
return y1 + y2
def fun_2(y1, y2):
return sum(y1 * y2)
result_1 = np.zeros_like(y1)
result_2 = np.zeros_like(y2)
for i in range(len(depth)):
prov_result_1 = fun_1(y1[i], y2[i])
prov_result_2 = fun_2(y1[i], y2[i])
result_1[i,:] = prov_result_1
result_2[i] = prov_result_2
In the example, I go depth by depth using the y1 and y2 values to calculate provisory values prov_result_1 and prov_result_2, which I then store in the respective index of result_1 and result_2, generating my final result_1 and result_2 curves when the loop ends. It's easy to see that, depending on the size of the depth array and the functions I apply at each depth, things can get out of hand and this code can take several hours to complete. Keep in mind that the y1 and y2 arrays can be much larger and the functions I apply are much more complex than these.
I would like to know if there is a way to use Python's multiprocessing library to parallelize this for loop. Other answers I've found here on StackOverflow don't directly translate to this problem and always seem to be much more complicated than they should be.
One option I imagine would be to divide the depth by the number of processors and do this same for loop in parallel, something like:
num_pool = 2
depth_cut = (depth[-1]-depth[0])/num_pool
depth_parallel = [depth[depth <= depth[0] + depth_cut],
depth[depth > depth[0] + depth_cut]]
y1_parallel = [y1[depth <= depth[0] + depth_cut],
y1[depth > depth[0] + depth_cut]]
y2_parallel = [y2[depth <= depth[0] + depth_cut],
y2[depth > depth[0] + depth_cut]]
Only more generic. Then I would put the pieces of data into parallel processing, perform my calculations, and then concatenate everything again.
Solution
You could then :
- use
multiprocessing
library, - create two parallelize functions to deal separately with chunks
The general idea is to create a separate process for each function, and deal with chunks one by one inside each function-process.
To my mind this is a simple way to keep control on differential treatment speeds related to both functions.
Proposed code
import numpy as np
from multiprocessing import Pool
def fun_1(args):
y1_i, y2_i = args
return y1_i + y2_i
def fun_2(args):
y1_i, y2_i = args
return sum(y1_i * y2_i)
def separate_process(func, y1, y2, num_pool):
"""Separated process to deal chunks with one function at a time"""
# Create a pool of workers
pool = Pool(num_pool)
# Divide the data into chunks for treatment
# (good thing to do with large datasets)
### Assuming len(y1) = len(y2)
chunk_size = len(y1) // num_pool
chunks_y1 = [y1[i:i + chunk_size] for i in range (0, len(y1), chunk_size)]
chunks_y2 = [y2[i:i + chunk_size] for i in range (0, len(y2), chunk_size)]
# Apply function for each chunk in parallel
results = pool.map(func, zip(chunks_y1, chunks_y2))
# Avoid 'float' exception
results = [result.tolist() if isinstance(result, np.ndarray) else [result] for result in results]
# Close the pool
pool.close()
# Wait for the process to finish to keep control
pool.join()
# Concatenation : chunks treatment strategy is not supposed to be visible at the end
return results
depth = np.linspace(5000, 6000, 2001)
y1 = np.random.random(len(depth))
y2 = np.random.random(len(depth))
num_pool = 2
fun_1_res = separate_process(fun_1, y1, y2, num_pool)
fun_2_res = separate_process(fun_2, y1, y2, num_pool)
print(len(fun_1_res))
### 3
print(len(fun_2_res))
### 3
Chunks results concatenation :
def concatenate_chunks(fun_x_res):
return sum(fun_x_res, [])
fun_1_res = concatenate_chunks(fun_1_res)
# Final length (supposed to be 2001)
print(len(fun_1_res))
### 2001
Note :
We could widen the perspective and calculate fun_res_1
and fun_res_2
in two separate processes too.
So you should have in this last case :
2 main processes + 2 X 2 sub-processes (in separate_process) = a total of 6 processes
A powerful computing environment is required for this.
Answered By - Laurent B.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.