Issue
I am trying to download 12,000 files from s3 bucket using jupyter notebook, which is estimating to complete download in 21 hours. This is because each file is downloaded one at a time. Can we do multiple downloads parallel to each other so I can speed up the process?
Currently, I am using the following code to download all files
### Get unique full-resolution image basenames
images = df['full_resolution_image_basename'].unique()
print(f'No. of unique full-resolution images: {len(images)}')
### Create a folder for full-resolution images
images_dir = './images/'
os.makedirs(images_dir, exist_ok=True)
### Download images
images_str = "','".join(images)
limiting_clause = f"CONTAINS(ARRAY['{images_str}'],
full_resolution_image_basename)"
_ = download_full_resolution_images(images_dir,
limiting_clause=limiting_clause)
Solution
See the code below. This will only work with python 3.6+, because of the f-string (PEP 498). Use a different method of string formatting for older versions of python.
Provide the relative_path
, bucket_name
and s3_object_keys
. In addition, max_workers is optional, and if not provided the number will be a multiple of 5 times the number of machine processors.
Most of the code for this answer came from an answer to How to create an async generator in Python? which sources from this example documented in the library.
import boto3
import os
from concurrent import futures
relative_path = './images'
bucket_name = 'bucket_name'
s3_object_keys = [] # List of S3 object keys
max_workers = 5
abs_path = os.path.abspath(relative_path)
s3 = boto3.client('s3')
def fetch(key):
file = f'{abs_path}/{key}'
os.makedirs(file, exist_ok=True)
with open(file, 'wb') as data:
s3.download_fileobj(bucket_name, key, data)
return file
def fetch_all(keys):
with futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_key = {executor.submit(fetch, key): key for key in keys}
print("All URLs submitted.")
for future in futures.as_completed(future_to_key):
key = future_to_key[future]
exception = future.exception()
if not exception:
yield key, future.result()
else:
yield key, exception
for key, result in fetch_all(S3_OBJECT_KEYS):
print(f'key: {key} result: {result}')
Answered By - Diego Goding
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.