Issue
I have 3 huge data frames of 40 GB size, I opened them using chunks. Then, I wanna concatenate them together. Here is what I tried:
path = 'path/to/myfiles'
files = [os.path.join(path,i) for i in os.listdir(path) if i.endswith('tsv')]
for file in files:
cols = ['col1','col2','col3']
chunks = pd.read_table(file, sep='\t', names=cols, chunksize=10000000)
However, when I try to concatenate all the files, it is taking forever. I would like to have some suggestions to concatenate all the data frames quicker/faster.
Solution
.tsv
and .csv
are fairly slow formats to read/write. I've found parquet works best for most of the stuff I end up doing. It's quite fast on reads and writes, and also allows you to read back a chunked folder of files as a single table easily. It does require string column names, however:
In [102]: df = pd.DataFrame(np.random.random((100000, 100)), columns=[str(i) for i in range(100)])
In [103]: %time df.to_parquet("out.parquet")
Wall time: 980 ms
In [104]: %time df.to_csv("out.csv")
Wall time: 14 s
In [105]: %time df = pd.read_parquet("out.parquet")
Wall time: 195 ms
In [106]: %time df = pd.read_csv("out.csv")
Wall time: 1.53 s
If you don't have control over the format those chunked files are in, you'll obviously need to pay the read cost of them at least once, but converting them could still save you some time in the long run if you do a lot of other read/writes.
Answered By - Randy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.