Saturday, February 3, 2024

[FIXED] What is the fastest way to read a csv file sort the data then write the sorted data into another csv

February 03, 2024 bigdata, dataframe, pandas, python, python-3.x No comments

Issue

I have a massive data set ~600Gb of several csv files. Each csv file contains 1.3mil x 17 sets of data. It looks like this

index        duration  is_buy_order       issued        location_id  min_volume  order_id        price   range  system_id  type_id  volume_remain  volume_total  region_id    http_last_modified  station_id  constellation_id universe_id
0              90          True  2021-05-04T23:31:50Z     60014437           1  5980151223         5.05  region   30000001       18         249003        250000   10000001  2021-06-19T16:45:32Z  60014437.0          20000001         eve
1              90          True  2021-04-29T07:40:27Z     60012145           1  5884280397         5.01  region   30000082       18          13120        100000   10000001  2021-06-19T16:45:32Z  60012145.0          20000012         eve
2              90         False  2021-04-28T11:46:09Z     60013867           1  5986716666     12500.00  region   30000019       19            728           728   10000001  2021-06-19T16:45:32Z  60013867.0          20000003         eve
3              90         False  2021-05-22T14:13:15Z     60013867           1  6005466300      6000.00  region   30000019       19           5560          9191   10000001  2021-06-19T16:45:32Z  60013867.0          20000003         eve
4              90         False  2021-05-27T08:14:29Z     60013867           1  6008912593      5999.00  region   30000019       19              1             1   10000001  2021-06-19T16:45:32Z

I currently have it in a dataframe. I run it through some logic filter out all the data by a particular "region_id" im looking for then put that into an empty dataframe. Something like this:

path = pathlib.Path('somePath')
data = pd.read_csv(path)
region_index = data.columns.get_loc('region_id')

newData = pd.DataFrame(columns=data.columns)

for row in data.values:
  if row[region_index] == region.THE_FORGE.value:
    
    newData.loc[len(newData)] = row.tolist()
  
newData.to_csv(newCSVName, index=False)

This however takes ~74min to go through a single file... I need to do this over 600gb worth of files...

So as the title mentions what is the fastest way I can/ should do this that i can do iteratively over all the csv's. I have thought about using async but not sure if that is the best way.

Solution

pandas offers optimized C based functions that work on the entire table using native data types. When you iterate rows, look at individual values and convert things to lists, pandas must convert its native data types to python objects - and that can be slow. As you assign new rows, pandas must copy the table you've built so far, and that gets more and more expensive as the table grows.

It looks like you could filter the dataframe by a single known region_id and write the csv directly

path = pathlib.Path('somePath')
data = pd.read_csv(path)
data[data['region_id'] == region.THE_FORGE.value]].to_csv(newCSVName, index=False)

Answered By - tdelaney

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, February 3, 2024

[FIXED] What is the fastest way to read a csv file sort the data then write the sorted data into another csv

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels