Tuesday, November 28, 2023

[FIXED] Optimize Searching method in File at Python

November 28, 2023 file, optimization, python, python-3.x, refactoring No comments

Issue

I am trying to optimize the method below. It is the core of my project(as the % of time in the method is close to 95%). It reads a line of file and if the tid is in the line it returns the first number, which is the document id. A few lines of the file for example:

5168  268:0.0482384162801528 297:0.0437108092315354 352:0.194373864228161
5169  268:0.0444310314892627 271:0.114435072663748 523:0.0452228057908503

The current implementation uses the method tid_add_colon_in_front(tid) as the tid is just a string, and the did_tids_file is the file that has the data (has been opened already)

Any ideas as to how I can improve it any further will be welcome!

def dids_via_tid(tid) -> set:
    did_tids_file.seek(0)
    dids = set()

    #To find the term ids in the file
    tid = tid_add_colon_in_front(tid)
    did_str = ""

    #Τo not do line.split
    for line in did_tids_file:
      did_str = ""

      if tid in line:
        for char in line:
          if char == " ":
            break
          did_str += char

        dids.add(did_str)
    return dids

My previous implementation was with line.split, which return's a list and by my current knowledge is heavier in memory and time when dealing with very big amounts of data.

Also, I have tried reading data from the file with the readLine as below, but it didnt improve the performance

line = myFile.readLine()
while line:
 #Do work
 line = myFile.readLine()

Solution

If you have multiple tids that you are looking for, you 100% should be searching for them all during 1 pass through the file. It will be much faster if your file size is 100K plus lines.

# tid line finder

import re
from collections import defaultdict


def tid_searcher(filename, tids_of_interest):
    res = defaultdict(list)
    with open(filename, 'r') as src:
        for line in src:
            line_tids = set(re.findall(r'(\d+):', line)) # re:  group of one or more digits followed by colon
            hits = tids_of_interest & line_tids  # set intersection
            if hits:
                line_no = re.search(r'\A\d+', line).group(0) # re: one or more digits at start of string
                for hit in hits:
                    res[hit].append(line_no)

    return res

tids_of_interest = {'268', '271'}
filename = 'data.txt'

print(tid_searcher(filename, tids_of_interest))

# defaultdict(<class 'list'>, {'268': ['5168', '5169'], '271': ['5169']})

Answered By - AirSquid

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, November 28, 2023

[FIXED] Optimize Searching method in File at Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels