Issue
I am using a loop (that was answered in this question) to iteratively open several csv files, transpose them, and concatenate them into a large dataframe. Each csv file is 15 mb and over 10,000 rows. There are over 1000 files. I am finding that the first 50 loops happen within a few seconds but then each loop takes a minute. I wouldn't mind keeping my computer on overnight but I may need to do this multiple times and I'm worried that it will get exponentially slower. Is there a more memory efficient way to do this such breaking up df into chunks of 50 rows each and then concatenating all of them at the end?
In the following code, df is a dataframe of 1000 rows that has columns to indicate folder and file name.
merged_data = pd.DataFrame()
count = 0
for index, row in df.iterrows():
folder_name = row['File ID'].strip()
file_name = row['File Name'].strip()
file_path = os.path.join(root_path, folder_name, file_name)
file_data = pd.read_csv(file_path, names=['Case', f'{folder_name}_{file_name}'], sep='\t')
file_data_transposed = file_data.set_index('Case').T.reset_index(drop=True)
file_data_transposed.insert(loc=0, column='folder_file_id', value=str(folder_name+'_'+file_name))
merged_data = pd.concat([merged_data, file_data_transposed], axis=0, ignore_index=True)
count = count + 1
print(count)
Solution
The reason the code is slow is because you are using concat
in the loop. You should collect the data in a python dictionary then do a single concat
at the end.
With few improvements:
import pathlib
import pandas as pd
root_path = pathlib.Path('root') # use pathlib instead of os.path
data = {}
# use enumerate rather than create an external counter
for count, (_, row) in enumerate(df.iterrows(), 1):
folder_name = row['File ID'].strip()
file_name = row['File Name'].strip()
file_path = root_path / folder_name / file_name
folder_file_id = f'{folder_name}_{file_name}'
file_data = pd.read_csv(file_path, header=None, sep='\t',
names=['Case', folder_file_id],
memory_map=True, low_memory=False)
data[folder_file_id] = file_data.set_index('Case').squeeze()
print(count)
merged_data = (pd.concat(data, names=['folder_file_id'])
.unstack('Case').reset_index())
Output:
>>> merged_data
Case folder_file_id 0 1 2 3 4
0 folderA_file001.txt 1234.0 5678.0 9012.0 3456.0 7890.0
1 folderB_file002.txt 4567.0 8901.0 2345.0 6789.0 NaN
Input data:
>>> df
File ID File Name
0 folderA file001.txt
1 folderB file002.txt
>>> cat root/folderA/file001.txt
0 1234
1 5678
2 9012
3 3456
4 7890
>>> cat root/folderB/file002.txt
0 4567
1 8901
2 2345
3 6789
Multithreaded version:
from concurrent.futures import ThreadPoolExecutor
import pathlib
import pandas as pd
root_path = pathlib.Path('root')
def read_csv(args):
count, row = args # expand arguments
folder_name = row['File ID'].strip()
file_name = row['File Name'].strip()
file_path = root_path / folder_name / file_name
folder_file_id = f'{folder_name}_{file_name}'
file_data = pd.read_csv(file_path, header=None, sep='\t',
names=['Case', folder_file_id],
memory_map=True, low_memory=False)
print(count)
return folder_file_id, file_data.set_index('Case').squeeze()
with ThreadPoolExecutor(max_workers=2) as executor:
batch = enumerate(df[['File ID', 'File Name']].to_dict('records'), 1)
data = executor.map(read_csv, batch)
merged_data = (pd.concat(dict(data), names=['folder_file_id'])
.unstack('Case').reset_index())
Answered By - Corralien
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.