Thursday, January 13, 2022

[FIXED] How to get the list of files in the GCS Bucket using the Jupyter notebook in Dataproc?

January 13, 2022 google-cloud-dataproc, google-cloud-platform, google-cloud-storage, jupyter-notebook, python No comments

Issue

I have recently started using GCP for my project and encountered difficulties when working with the bucket from the Jupyter notebook in the Dataproc cluster. At the moment I have a bucket with a bunch of files in it, and a Dataproc cluster with the Jupyter notebook. What I am trying to do is go over all the files in the bucket and extract the data from them to create a dataframe.

I can access one file at a time with the following code: data = spark.read.csv('gs://BUCKET_NAME/PATH/FILENAME.csv'), but there are hundreds of files, and I cannot write a line of code for each of them. Usually, I would do something like this:

import os
for filename in os.listdir(directory):
...

but this does not seem to work here. So, I was wondering, how do I iterate over files in a bucket using Jupyter notebook in the Dataproc cluster?

Would appreciate any help!

Solution

You can list the elements in your bucket with the following commands:

from google.cloud import storage
client = storage.Client()
BUCKET_NAME = 'your_bucket_name'
bucket = client.get_bucket(BUCKET_NAME)
elements = bucket.list_blobs()
files=[a.name for a in elements]

If there are no folders in your bucket, the list called files will contain the names of the files.

Answered By - Javier A

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 13, 2022

[FIXED] How to get the list of files in the GCS Bucket using the Jupyter notebook in Dataproc?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels