Friday, June 24, 2022

[FIXED] Is there any faster way for downloading multiple files from s3 to local folder?

June 24, 2022 amazon-s3, amazon-web-services, boto3, jupyter-notebook, python-3.6 No comments

Issue

I am trying to download 12,000 files from s3 bucket using jupyter notebook, which is estimating to complete download in 21 hours. This is because each file is downloaded one at a time. Can we do multiple downloads parallel to each other so I can speed up the process?

Currently, I am using the following code to download all files

### Get unique full-resolution image basenames
images = df['full_resolution_image_basename'].unique()
print(f'No. of unique full-resolution images: {len(images)}')

### Create a folder for full-resolution images
images_dir = './images/'
os.makedirs(images_dir, exist_ok=True)

### Download images
images_str = "','".join(images)
limiting_clause = f"CONTAINS(ARRAY['{images_str}'], 
full_resolution_image_basename)"
_ = download_full_resolution_images(images_dir, 
limiting_clause=limiting_clause)

Solution

See the code below. This will only work with python 3.6+, because of the f-string (PEP 498). Use a different method of string formatting for older versions of python.

Provide the relative_path, bucket_name and s3_object_keys. In addition, max_workers is optional, and if not provided the number will be a multiple of 5 times the number of machine processors.

Most of the code for this answer came from an answer to How to create an async generator in Python? which sources from this example documented in the library.

import boto3
import os
from concurrent import futures


relative_path = './images'
bucket_name = 'bucket_name'
s3_object_keys = [] # List of S3 object keys
max_workers = 5

abs_path = os.path.abspath(relative_path)
s3 = boto3.client('s3')

def fetch(key):
    file = f'{abs_path}/{key}'
    os.makedirs(file, exist_ok=True)  
    with open(file, 'wb') as data:
        s3.download_fileobj(bucket_name, key, data)
    return file


def fetch_all(keys):

    with futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_key = {executor.submit(fetch, key): key for key in keys}

        print("All URLs submitted.")

        for future in futures.as_completed(future_to_key):

            key = future_to_key[future]
            exception = future.exception()

            if not exception:
                yield key, future.result()
            else:
                yield key, exception


for key, result in fetch_all(S3_OBJECT_KEYS):
    print(f'key: {key}  result: {result}')

Answered By - Diego Goding

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, June 24, 2022

[FIXED] Is there any faster way for downloading multiple files from s3 to local folder?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels