Issue
Currently I'm having the following implementation to write a list
of dictionaries
into a file. The file_limit_counter
below would be initialized to 0 whereas the file_limit
will be initialized to let's say 50 for now. So whenever the counter becomes equal to the file limit, it'll start writing the output into a new file. The counter approach is to break the output into multiple files:
self.file = self.generate_new_micro_file_name()
for response_dict in all_response_dict_list:
if file_limit_counter == file_limit:
self.propagate_log_msg('wrote {} records '.format(file_limit))
self.file = self.generate_new_micro_file_name()
file_limit_counter = 0
with open(self.file, 'a', encoding="utf8") as open_out_file:
json.dump(response_dict, open_out_file)
open_out_file.write('\n')
file_limit_counter += 1
The list all_response_dict_list
would contain something like this:
somelist = [{"Name":"a1","Age":"24"},{"Name":"a2","Age":"26"}]
and my intention is to have something like this on my output file:
{"Name":"a1","Age":"24"}
{"Name":"a2","Age":"26"}
...
This above approach works fine. But when it comes to a large set of dictionaries for example 5000 it tends to slow down a bit (it takes approximately 10 mins). So it'll be helpful if someone have already come across this kind of scenario. I think it would take less time if we can do the same thing above in parallel i.e. writing multiple dictionaries at the same time into the same file rather than writing one by one.
Solution
A slight variation on this theme that runs in <0.5s on my machine:-
import json
import time
all_response_dict_list = [{'name': f'a{i}', 'age': i} for i in range(50_000)]
file_limit = 50
def fgen():
fnum = 1
while True:
yield f'/Users/andy/j/base{fnum}.json'
fnum += 1
def main():
G = fgen()
for offset in range(0, len(all_response_dict_list), file_limit):
with open(next(G), 'a') as outfile:
for rd in all_response_dict_list[offset:offset+file_limit]:
json.dump(rd, outfile)
outfile.write('\n')
if __name__ == '__main__':
start = time.perf_counter()
main()
end = time.perf_counter()
print(f'Duration={end-start:.4f}s')
Answered By - BrutusForcus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.