Issue
I am trying to optimize the method below. It is the core of my project(as the % of time in the method is close to 95%). It reads a line of file and if the tid is in the line it returns the first number, which is the document id. A few lines of the file for example:
5168 268:0.0482384162801528 297:0.0437108092315354 352:0.194373864228161
5169 268:0.0444310314892627 271:0.114435072663748 523:0.0452228057908503
The current implementation uses the method tid_add_colon_in_front(tid) as the tid is just a string, and the did_tids_file is the file that has the data (has been opened already)
Any ideas as to how I can improve it any further will be welcome!
def dids_via_tid(tid) -> set:
did_tids_file.seek(0)
dids = set()
#To find the term ids in the file
tid = tid_add_colon_in_front(tid)
did_str = ""
#Τo not do line.split
for line in did_tids_file:
did_str = ""
if tid in line:
for char in line:
if char == " ":
break
did_str += char
dids.add(did_str)
return dids
My previous implementation was with line.split, which return's a list and by my current knowledge is heavier in memory and time when dealing with very big amounts of data.
Also, I have tried reading data from the file with the readLine as below, but it didnt improve the performance
line = myFile.readLine()
while line:
#Do work
line = myFile.readLine()
Solution
If you have multiple tids
that you are looking for, you 100% should be searching for them all during 1 pass through the file. It will be much faster if your file size is 100K plus lines.
# tid line finder
import re
from collections import defaultdict
def tid_searcher(filename, tids_of_interest):
res = defaultdict(list)
with open(filename, 'r') as src:
for line in src:
line_tids = set(re.findall(r'(\d+):', line)) # re: group of one or more digits followed by colon
hits = tids_of_interest & line_tids # set intersection
if hits:
line_no = re.search(r'\A\d+', line).group(0) # re: one or more digits at start of string
for hit in hits:
res[hit].append(line_no)
return res
tids_of_interest = {'268', '271'}
filename = 'data.txt'
print(tid_searcher(filename, tids_of_interest))
# defaultdict(<class 'list'>, {'268': ['5168', '5169'], '271': ['5169']})
Answered By - AirSquid
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.