Issue
I am trying to read a tab separated value txt file in python that I extracted from AWS storage. (credentials censored for AWS with XXX)
import io
import pandas as pd
import boto3
import csv
from bioservices import UniProt
from sqlalchemy import create_engine
s3 = boto3.resource(
service_name='s3',
region_name='us-east-2',
aws_access_key_id='XXX',
aws_secret_access_key='XXX'
)
so thats simply for connecting to AWS. next when I run this code for reading a tab separated txt file that is stored in AWS
txt = s3.Bucket('compound-bioactivity-original-files').Object('helper-files/kinhub_human_kinase_list_30092021.txt').get()
txt_reader = csv.reader(txt, delimiter='\t')
for line in txt_reader:
print(line)
I get this output which is not what what I am looking for. And using dialect='excel-tab' instead of delimiter='\t' gives me the same output as well
['ResponseMetadata']
['AcceptRanges']
['LastModified']
['ContentLength']
['ETag']
['VersionId']
['ContentType']
['Metadata']
['Body']
Solution
There are several issues with your code.
First, Object.get()
does not return the contents of the Amazon S3 object. Instead, as per the Object.get()
documentation, it returns:
{
'Body': StreamingBody(),
'AcceptRanges': 'string',
'LastModified': datetime(2015, 1, 1),
'ContentLength': 123,
'ETag': 'string',
'VersionId': 'string',
'CacheControl': 'string',
'ContentDisposition': 'string',
...
'BucketKeyEnabled': True|False,
'TagCount': 123,
}
You can see this happening by inserting print(txt)
as a debugging line.
If you want to access the contents of the object, you would use the Body
element. To retrieve the contents of the streaming body, you can use .read()
.
However, this comes back as a binary string since the object is treated as a binary file. In Python, you can convert it back to ASCII by using .decode('ascii')
. See: How to convert 'binary string' to normal string in Python3?
Therefore, you would actually need to use:
txt = s3.Bucket('bucketname').Object('object.txt').get()['Body'].read().decode('ascii')
(If that seems too complex, then you could have simply downloaded the file to the local disk, then use the CSV Reader on the file -- it would have worked nicely without having to use get/read/decode.)
The next issue, is that the documentation for csv.reader
says:
csv.reader(csvfile, dialect='excel', **fmtparams)
Return a reader object which will iterate over lines in the given csvfile. csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called
Since the decode()
command returns a string, then the for
loop will iterate over individual characters in the string, not lines within the string.
Frankly, you could process the lines without using the CSV Reader, simply by splitting on the lines and the tabs, like this:
import boto3
s3 = boto3.resource('s3')
txt = s3.Bucket('bucketname').Object('object.txt').get()['Body'].read().decode('ascii')
lines = txt.split('\n')
for line in lines:
fields = line.split('\t')
print(fields)
All of the above issues should have been noticeable by adding some debugging to see whether each step was returning the data that you expected, such as printing the contents of the variables after each step.
Answered By - John Rotenstein
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.