Issue
I have an archive file (archive.tar.gz) which contains multiple archive files (file.txt.gz).
If I first extract the .txt.gz files to a folder, I can then open them with pandas directly using:
import pandas as pd
df = pd.read_csv('file.txt.gz', sep='\t', encoding='utf-8')
But if I explore the archive using the tarfile library, then it doesn't work:
import pandas as pd
import tarfile
tar = tarfile.open("archive.tar.gz", "r:*")
csv_path = tar.getnames()[1]
df = pd.read_csv(tar.extractfile(csv_path), sep='\t', encoding='utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Is that possible to do?
Solution
When you open the file by filename, then Pandas will be able to infer that it is compressed with gzip due to the *.gz
extension on the filename.
When you pass it a file object, you need to tell it explicitly about the compression so that it can decompress it as it reads the file.
This should work:
df = pd.read_csv(
tar.extractfile(csv_path),
compression='gzip',
sep='\t',
encoding='utf-8')
For more details, see the entry about the "compression" argument in the documentation for read_csv().
Answered By - filbranden
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.