Issue
I have a huge compressed file on which I am interested in reading the individual dataframes, so as not to run out of memory.
Also, due to time and space, I can't unzip the .tar.gz.
This is the code I've got this far:
import pandas as pd
# With this lib we can navigate on a compressed files
# without even extracting its content
import tarfile
import io
tar_file = tarfile.open(r'\\path\to\the\tar\file.tar.gz')
# With the following code we can iterate over the csv contained in the compressed file
def generate_individual_df(tar_file):
return \
(
(
member.name, \
pd.read_csv(io.StringIO(tar_file.extractfile(member).read().decode('ascii')), header=None)
)
for member in tar_file
if member.isreg()\
)
for filename, dataframe in generate_individual_df(tar_file):
# But dataframe is the whole file, which is too big
Tried the How to create Panda Dataframe from csv that is compressed in tar.gz? but still can't solve ...
Solution
You actually can iterate over the chunks inside a compressed file with the following function:
def generate_individual_df(tar_file, chunksize=10**4):
return \
(
(
member.name, \
chunk
)
for member in tar_file
if member.isreg()\
for chunk in pd.read_csv(io.StringIO(tar_file.extractfile(member)\
.read().decode('ascii')), header=None, chunksize=chunksize)
)
Answered By - Alexander Martins
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.