Issue
Trying to use pytesseract to read a few blocks of text but it isn't recognizing symbols when they are in front of or between words. It does however recognize the symbols when they are in front of numbers.
Example:
'#test $test %test'
on the image prints wrong 'Htest Stest Stest'
'#500 $500 %500'
on the image prints correct '#500 $500 %500'
Here is my code:
import cv2
import pytesseract
from PIL import Image
image = cv2.imread("test.png")
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
threshold = 225
_, img_binarized = cv2.threshold(image, threshold, 255, cv2.THRESH_BINARY)
pil_img = Image.fromarray(img_binarized)
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
msg = pytesseract.image_to_string(pil_img)
print(msg)
I have played around with a bunch of different config settings in the image_to_string
call but haven't found anything that works, any help is appreciated.
Solution
I ended up downloading all the .traineddata files from https://tesseract-ocr.github.io/tessdoc/Data-Files.html to my Tesseract-OCR
folder and looping through all of them using the language parameter of image_to_string
. For some reason a few select languages that share the same alphabet as English worked just fine (Italian and Croatian worked best).
My code is the same as above but language is adjusted:
msg = pytesseract.image_to_string(pil_img, lang='ita')
Answered By - mattwatkins
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.