Issue
I have a massive data set ~600Gb of several csv files. Each csv file contains 1.3mil x 17 sets of data. It looks like this
index duration is_buy_order issued location_id min_volume order_id price range system_id type_id volume_remain volume_total region_id http_last_modified station_id constellation_id universe_id
0 90 True 2021-05-04T23:31:50Z 60014437 1 5980151223 5.05 region 30000001 18 249003 250000 10000001 2021-06-19T16:45:32Z 60014437.0 20000001 eve
1 90 True 2021-04-29T07:40:27Z 60012145 1 5884280397 5.01 region 30000082 18 13120 100000 10000001 2021-06-19T16:45:32Z 60012145.0 20000012 eve
2 90 False 2021-04-28T11:46:09Z 60013867 1 5986716666 12500.00 region 30000019 19 728 728 10000001 2021-06-19T16:45:32Z 60013867.0 20000003 eve
3 90 False 2021-05-22T14:13:15Z 60013867 1 6005466300 6000.00 region 30000019 19 5560 9191 10000001 2021-06-19T16:45:32Z 60013867.0 20000003 eve
4 90 False 2021-05-27T08:14:29Z 60013867 1 6008912593 5999.00 region 30000019 19 1 1 10000001 2021-06-19T16:45:32Z
I currently have it in a dataframe. I run it through some logic filter out all the data by a particular "region_id" im looking for then put that into an empty dataframe. Something like this:
path = pathlib.Path('somePath')
data = pd.read_csv(path)
region_index = data.columns.get_loc('region_id')
newData = pd.DataFrame(columns=data.columns)
for row in data.values:
if row[region_index] == region.THE_FORGE.value:
newData.loc[len(newData)] = row.tolist()
newData.to_csv(newCSVName, index=False)
This however takes ~74min to go through a single file... I need to do this over 600gb worth of files...
So as the title mentions what is the fastest way I can/ should do this that i can do iteratively over all the csv's. I have thought about using async but not sure if that is the best way.
Solution
pandas
offers optimized C based functions that work on the entire table using native data types. When you iterate rows, look at individual values and convert things to lists, pandas
must convert its native data types to python objects - and that can be slow. As you assign new rows, pandas must copy the table you've built so far, and that gets more and more expensive as the table grows.
It looks like you could filter the dataframe by a single known region_id and write the csv directly
path = pathlib.Path('somePath')
data = pd.read_csv(path)
data[data['region_id'] == region.THE_FORGE.value]].to_csv(newCSVName, index=False)
Answered By - tdelaney
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.