Issue
I am trying parallelize thousands of downloads in python. Each download takes 2-3 seconds. I have looked at multi-threading vs multi-processing and it seems that multi-threading would be better for IO according .
I have a python list of urls, and I use this function to download them all.
for k in range(0, 90000):
id_sep = urls[k].rpartition('/')
path = 'DownloadFolder/' + id_sep[2] + '.pdf'
if not os.path.exists(path):
urllib.request.urlretrieve(arxiv_PDF_IDs[k], path)
I am wondering what's the optimal method to run downloads in parallel.
Another consideration is what is the optimal number of concurrent downloads. Does this have to do with the number of cores? My system has two according to this command
import multiprocessing
multiprocessing.cpu_count()
I have two cores. Does that mean that the optimal number of downloads to have at the same time is two? If so, how do I only do two downloads at a time, and queue the rest of the iterations?
Solution
Downloads are not a compute-bound process; core count is unlikely to drive your parallelism. Rather, this will depend on your network bandwidth (or your share thereof). We don't have your network configuration and physical characteristics, so there's not much we can predict.
However, the fastest path to a solution for you is likely running some short, empirical tests. Scale your parallelism by 3x or 4x on each run; you'll likely find the "sweet spot" quickly. You can try switching between proc/thread, but that shouldn't be the limiting factor -- it should be the network response balanced with that bandwidth.
Answered By - Prune
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.