Issue
I am trying to use PyTesseract for recognizing text from some scanned documents, which contain simple text as well as complicated diagrams. PyTesseract misinterprets the diagrams (or parts of it) as characters, which I do not want to happen. A solution to this problem would be to limit the maximum size (width, or height) of characters to be recognized, and ignore the rest of the larger characters (i.e.-the diagrams)
Is there any way I can limit the maximum size of characters to be recognized by PyTesseract?
I am using PyTesseract version - LooseVersion ('5.0.0-alpha.20200328') on Jupyter Notebook - Python 3.8.5
Solution
I found a trivial solution to this problem, so I am posting it here. I used pytesseract.image_to_boxes
data = pytesseract.image_to_boxes(img)
boxes = re.split(' ', data)
line = list()
coords = list()
letters = [boxes[0]]
for i in range(1, len(boxes)):
if (i%5 != 0):
line.append(int(boxes[i]))
else:
letters.append(boxes[i][2:])
coords.append(line)
line = []
for i in range(0, len(coords)):
print(letters[i], coords[i])
The above code segregates the letters and the coordinates into two respective lists. After this I used the condition
if (abs(coords[i][2] - coords[i][0]) < size) and (abs(coords[i][3] - coords[i][1]) < size+5):
to filter out the required characters
Answered By - Dev Bhuyan
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.