Issue
I need to read thousands of csv files and output them as a single csv file in Python.
Each of the original files will be used to create single row in the final output with columns being some operation on the rows of the original file.
Due to the combined size of the files, this takes many hours to process and also is not able to be fully loaded into memory.
I am able to read in each csv and delete it from memory to solve the RAM issue. However, I am currently iteratively reading and processing each csv (in Pandas) and appending the output row to the final csv, which seems slow. I believe I can use the multiprocessing library to have each process read and process its own csv, but wasn't sure if there was a better way than this.
What is the fastest way to complete this in Python while having RAM limitations?
As an example, ABC.csv and DEF.csv would be read and processed into individual rows in the final output csv. (The actual files would have tens of columns and hundreds of thousands of rows)
ABC.csv:
id,col1,col2
abc,2.3,3
abc,3.7,5
abc,3.0,9
DEF.csv:
id,col1,col2
def,1.9,3
def,2.8,2
def,1.6,1
Final Output:
id,col1_avg,col2_max
abc,3.0,9
def,2.1,3
Solution
I would suggest using dask
for this. It's a library that allows you to do parallel processing on large datasets.
import dask.dataframe as dd
df = dd.read_csv('*.csv')
df = df.groupby('id').agg({'col1': 'mean', 'col2': 'max'})
df.to_csv('output.csv')
Code explanation
dd.read_csv
will read all the csv files in the current directory and concatenate them into a single dataframe.
df.groupby('id').agg({'col1': 'mean', 'col2': 'max'})
will group the dataframe by the id
column and then calculate the mean of col1
and the max of col2
for each group.
df.to_csv('output.csv')
will write the dataframe to a csv file.
Performance
I tested this on my machine with a directory containing 10,000 csv files with 10,000 rows each. The code took about 2 minutes to run.
Installation
To install dask
, run pip install dask
.
Answered By - Fastnlight
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.