Issue
I'm using Tesseract with Python. I have an image with 1-6 words in it and need to read the text. Sometimes the character "C", which look the same in upper and lower case, is detected as lower case c instead of upper case C. I see the problem, but in context to the following letters it should be possible to detect the right notation. Is there any configuration or something to improve this? I'm using this code
import pytesseract
import argparse
import cv2
import os
# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
help="path to input image to be OCR'd")
args = vars(ap.parse_args())
# load the example image and convert it to grayscale
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# write the grayscale image to disk as a temporary file so we can
# apply OCR to it
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete
# the temporary file
text = pytesseract.image_to_string(gray)
print("Output: " + text)
I had a look at the configuration options of config='-psm x' with different values for x, but nothing fits to my problem
Solution
Updating from Tesseract 3 to Tesseract 5 fixed the problem
Answered By - csi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.