Issue
I have recently started using GCP for my project and encountered difficulties when working with the bucket from the Jupyter notebook in the Dataproc cluster. At the moment I have a bucket with a bunch of files in it, and a Dataproc cluster with the Jupyter notebook. What I am trying to do is go over all the files in the bucket and extract the data from them to create a dataframe.
I can access one file at a time with the following code: data = spark.read.csv('gs://BUCKET_NAME/PATH/FILENAME.csv')
, but there are hundreds of files, and I cannot write a line of code for each of them. Usually, I would do something like this:
import os
for filename in os.listdir(directory):
...
but this does not seem to work here. So, I was wondering, how do I iterate over files in a bucket using Jupyter notebook in the Dataproc cluster?
Would appreciate any help!
Solution
You can list the elements in your bucket with the following commands:
from google.cloud import storage
client = storage.Client()
BUCKET_NAME = 'your_bucket_name'
bucket = client.get_bucket(BUCKET_NAME)
elements = bucket.list_blobs()
files=[a.name for a in elements]
If there are no folders in your bucket, the list called files will contain the names of the files.
Answered By - Javier A
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.