Issue
I want to OCR pictures like the following where the single digit could be contained inside a white space surrounded by black color:
For this I am using pytesseract.__version__= 0.3.10
and
tesseract --version
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.38 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE
Trying the possible combinations of tesseract for oem and psm does not recognize the 4
in the image - see script below, which does not throw the desired ValueError()
.
How could I achieve to get it recognized though?
import pytesseract
import matplotlib
matplotlib.rcParams['figure.figsize'] = (20.0, 22.0)
from matplotlib import pyplot as plt
import cv2
import numpy as np
import traceback
import urllib
import urllib.request
url = 'https://i.stack.imgur.com/hWf3H.png'
req = urllib.request.urlopen(url)
arr = np.asarray(bytearray(req.read()), dtype=np.uint8)
img = cv2.imdecode(arr, -1) # 'Load it as it is'
print("pytesseract.__version__=", pytesseract.__version__)
plt.subplot(1, 5, 1), plt.imshow(img)
plt.show()
# sourceFileName = r'/home/user/Screenshot from 2022-12-11 22-40-48.png'
# img = cv2.imread(sourceFileName, cv2.IMREAD_UNCHANGED) # https://docs.opencv.org/3.4/d8/d6a/group__imgcodecs__flags.html
ocrResultString = ''
for oem in range(0, 4):
for psm in range(0, 14):
try:
inputParameter = 'psm: ' + str(psm) + ' oem: ' + str(oem)
print("inputParameter=", inputParameter)
config = '--psm ' + str(psm) + ' --oem ' + str(oem) + ' -c tessedit_char_whitelist=0123456789'
ocrResultString = pytesseract.image_to_string(img, config=config)
print("ocrResultString=\n", ocrResultString.strip())
except Exception as e:
print(traceback.format_exc())
print('e', e)
if '4' in ocrResultString.strip():
raise ValueError('SUCCESS ------- ' + ocrResultString + ' ' + config)
Solution
I think this would work, if you did a little image pre-processing on it first. A large part of this image is irrelevant to the problem, and if you can tell tesseract that, it's going to have a much easier time reading this image.
Specifically, you can identify the large black region, and set it to white, so that tesseract can tell that the white background and the black background are all part of the background.
Code:
from PIL import Image
import scipy.ndimage
import numpy as np
import pytesseract
def filter_image(img):
"""Filter large black regions from img"""
img = np.asarray(img)
# Convert to grayscale
img = img.mean(axis=2)
# Invert colors, black=white, white=black
img = 255 - img
# Convert to 1s and 0s
img = img > 127
img = filter_connected_components(img)
img = (img > 0) * 255
# Invert colors back
img = 255 - img
# Set numpy array to unsigned
img = img.astype('uint8')
img = Image.fromarray(img)
return img
def filter_connected_components(img, component_threshold_px=1000):
"""Filter regions of 1 values in `img`, where those regions are
larger than `component_threshold_px` in area. Replace those regions
with 0."""
labels, num_groups = scipy.ndimage.label(img)
for i in range(1, num_groups + 1):
size_of_component = (labels == i).sum()
if size_of_component > component_threshold_px:
labels[labels == i] = 0
return labels
img = Image.open('test216_img.png')
img = filter_image(img)
print(pytesseract.image_to_string(img, config='--psm 10'))
This is what the provided image looks like after pre-processing:
Tesseract still can't read this on default settings. In order to read this, you'll need to use PSM 10, which corresponds to "Treat the image as a single character."
print(pytesseract.image_to_string(img, config='--psm 10'))
Which prints:
4
Answered By - Nick ODell
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.