Issue
I have 200 pairs of paths to diff. I wrote a little function that will diff each pair and update a dictionary which itself is one of the arguments to the function. Assume MY_DIFFER
is some diffing tool I am calling via subprocess
under the hood.
async def do_diff(path1, path2, result):
result[f"{path1} {path2}"] = MY_DIFFER(path1, path2)
As you can see I have nothing to return from this async function. I am just capturing the result in result
.
I call this function in parallel elsewhere using asyncio like so:
path_tuples = [("/path11", "/path12"), ("/path21", "/path22"), ... ]
result = {}
loop = asyncio.get_event_loop()
loop.run_until_complete(
asyncio.gather(
*(do_diff(path1, path2, result) for path1, path2 in path_tuples)
)
)
Questions:
- I don't know where to put
await
in thedo_diff
function. But the code seems to work without it as well. - I am not sure if the diffs are really happening in parallel, because when I look at the output of
ps -eaf
in another terminal, I see only one instance of the underlying tool I am calling at a time. - The speed of execution is same as when I was doing the diffs sequentially
So I am clearly doing something wrong. How can I REALLY do the diffs in parallel?
PS: I am in Python 3.6
Solution
Remember that asyncio doesn't run things in parallel, it runs things concurrently, using a cooperative multitasking model -- which means that coroutines need to explicitly yield time to other coroutines for them to run. This is what the await
command does; it says "go run some other coroutines while I'm waiting for something to finish".
If you're never awaiting
on something, you're not getting concurrent execution.
What you want is for your do_diff
method to be able to await
on the execution of your external tool, but you can't do that with just the subprocess
module. You can do that using the run_in_executor
method, which arranges to run a synchronous command (e.g., subprocess.run
) in a separate thread or process and wait asynchronously for the result. That might look something like:
async def do_diff(path1, path2, result):
loop = asyncio.get_event_loop()
result[f"{path1} {path2}"] = await loop.run_in_executor(None, MY_DIFFER, path1, path2)
This will by default run MY_DIFFER
in a separate thread, although you can utilize a separate process instead by passing an explicit executor as the first argument to run_in_executor
.
Per my comment, solving this with concurrent.futures
might look something like this:
import concurrent.futures
import time
# dummy function that just sleeps for 2 seconds
# replace this with your actual code
def do_diff(path1, path2):
print(f"diffing path {path1} and {path2}")
time.sleep(2)
return path1, path2, "information about diff"
# create 200 path tuples for demonstration purposes
path_tuples = [(f"/path{x}.1", f"/path{x}.2") for x in range(200)]
futures = []
with concurrent.futures.ProcessPoolExecutor(max_workers=100) as executor:
for path1, path2 in path_tuples:
# submit the job to the executor
futures.append(executor.submit(do_diff, path1, path2))
# read the results
for future in futures:
print(future.result())
Answered By - larsks
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.