Saturday, January 20, 2024

[FIXED] Breaking up dataframe into chunks for a loop

January 20, 2024 pandas, python No comments

Issue

I am using a loop (that was answered in this question) to iteratively open several csv files, transpose them, and concatenate them into a large dataframe. Each csv file is 15 mb and over 10,000 rows. There are over 1000 files. I am finding that the first 50 loops happen within a few seconds but then each loop takes a minute. I wouldn't mind keeping my computer on overnight but I may need to do this multiple times and I'm worried that it will get exponentially slower. Is there a more memory efficient way to do this such breaking up df into chunks of 50 rows each and then concatenating all of them at the end?

In the following code, df is a dataframe of 1000 rows that has columns to indicate folder and file name.

 merged_data = pd.DataFrame()
 count = 0
 for index, row in df.iterrows():
    folder_name = row['File ID'].strip()
    file_name = row['File Name'].strip()
    file_path = os.path.join(root_path, folder_name, file_name)
    file_data = pd.read_csv(file_path, names=['Case', f'{folder_name}_{file_name}'], sep='\t')
    file_data_transposed = file_data.set_index('Case').T.reset_index(drop=True)
    file_data_transposed.insert(loc=0, column='folder_file_id', value=str(folder_name+'_'+file_name))
    merged_data = pd.concat([merged_data, file_data_transposed], axis=0, ignore_index=True)
    count = count + 1
    print(count)

Solution

The reason the code is slow is because you are using concat in the loop. You should collect the data in a python dictionary then do a single concat at the end.

With few improvements:

import pathlib
import pandas as pd

root_path = pathlib.Path('root')  # use pathlib instead of os.path

data = {}
# use enumerate rather than create an external counter
for count, (_, row) in enumerate(df.iterrows(), 1):
    folder_name = row['File ID'].strip()
    file_name = row['File Name'].strip()
    file_path = root_path / folder_name / file_name
    folder_file_id = f'{folder_name}_{file_name}'

    file_data = pd.read_csv(file_path, header=None, sep='\t',
                            names=['Case', folder_file_id],
                            memory_map=True, low_memory=False)
    data[folder_file_id] = file_data.set_index('Case').squeeze()
    print(count)

merged_data = (pd.concat(data, names=['folder_file_id'])
                 .unstack('Case').reset_index())

Output:

>>> merged_data
Case       folder_file_id       0       1       2       3       4
0     folderA_file001.txt  1234.0  5678.0  9012.0  3456.0  7890.0
1     folderB_file002.txt  4567.0  8901.0  2345.0  6789.0     NaN

Input data:

>>> df
   File ID    File Name
0  folderA  file001.txt
1  folderB  file002.txt

>>> cat root/folderA/file001.txt
0   1234
1   5678
2   9012
3   3456
4   7890

>>> cat root/folderB/file002.txt
0   4567
1   8901
2   2345
3   6789

Multithreaded version:

from concurrent.futures import ThreadPoolExecutor
import pathlib
import pandas as pd

root_path = pathlib.Path('root')


def read_csv(args):
    count, row = args  # expand arguments
    folder_name = row['File ID'].strip()
    file_name = row['File Name'].strip()
    file_path = root_path / folder_name / file_name
    folder_file_id = f'{folder_name}_{file_name}'

    file_data = pd.read_csv(file_path, header=None, sep='\t',
                            names=['Case', folder_file_id],
                            memory_map=True, low_memory=False)
    print(count)
    return folder_file_id, file_data.set_index('Case').squeeze()

with ThreadPoolExecutor(max_workers=2) as executor:
    batch = enumerate(df[['File ID', 'File Name']].to_dict('records'), 1)
    data = executor.map(read_csv, batch)

merged_data = (pd.concat(dict(data), names=['folder_file_id'])
                 .unstack('Case').reset_index())

Answered By - Corralien

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 20, 2024

[FIXED] Breaking up dataframe into chunks for a loop

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels