Wednesday, November 22, 2023

[FIXED] How to OCR single character with tesseract?

November 22, 2023 python, python-tesseract, tesseract No comments

Issue

I want to OCR pictures like the following where the single digit could be contained inside a white space surrounded by black color:

For this I am using pytesseract.__version__= 0.3.10 and

tesseract --version
tesseract 4.0.0-beta.1
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.38 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE

Trying the possible combinations of tesseract for oem and psm does not recognize the 4 in the image - see script below, which does not throw the desired ValueError().

How could I achieve to get it recognized though?

import pytesseract
import matplotlib

matplotlib.rcParams['figure.figsize'] = (20.0, 22.0)
from matplotlib import pyplot as plt

import cv2
import numpy as np
import traceback

import urllib
import urllib.request

url = 'https://i.stack.imgur.com/hWf3H.png'
req = urllib.request.urlopen(url)
arr = np.asarray(bytearray(req.read()), dtype=np.uint8)
img = cv2.imdecode(arr, -1)  # 'Load it as it is'

print("pytesseract.__version__=", pytesseract.__version__)

plt.subplot(1, 5, 1), plt.imshow(img)
plt.show()

# sourceFileName = r'/home/user/Screenshot from 2022-12-11 22-40-48.png'
# img = cv2.imread(sourceFileName, cv2.IMREAD_UNCHANGED)  # https://docs.opencv.org/3.4/d8/d6a/group__imgcodecs__flags.html

ocrResultString = ''
for oem in range(0, 4):
    for psm in range(0, 14):
        try:
            inputParameter = 'psm: ' + str(psm) + ' oem: ' + str(oem)
            print("inputParameter=", inputParameter)
            config = '--psm ' + str(psm) + ' --oem ' + str(oem) + ' -c tessedit_char_whitelist=0123456789'
            ocrResultString = pytesseract.image_to_string(img, config=config)
            print("ocrResultString=\n", ocrResultString.strip())
        except Exception as e:
            print(traceback.format_exc())
            print('e', e)
        if '4' in ocrResultString.strip():
            raise ValueError('SUCCESS ------- ' + ocrResultString + ' ' + config)

Solution

I think this would work, if you did a little image pre-processing on it first. A large part of this image is irrelevant to the problem, and if you can tell tesseract that, it's going to have a much easier time reading this image.

Specifically, you can identify the large black region, and set it to white, so that tesseract can tell that the white background and the black background are all part of the background.

Code:

from PIL import Image
import scipy.ndimage
import numpy as np
import pytesseract


def filter_image(img):
    """Filter large black regions from img"""
    img = np.asarray(img)
    # Convert to grayscale
    img = img.mean(axis=2)
    # Invert colors, black=white, white=black
    img = 255 - img
    # Convert to 1s and 0s
    img = img > 127

    img = filter_connected_components(img)

    img = (img > 0) * 255
    # Invert colors back
    img = 255 - img
    # Set numpy array to unsigned
    img = img.astype('uint8')
    img = Image.fromarray(img)
    return img

def filter_connected_components(img, component_threshold_px=1000):
    """Filter regions of 1 values in `img`, where those regions are
    larger than `component_threshold_px` in area. Replace those regions
    with 0."""
    labels, num_groups = scipy.ndimage.label(img)
    for i in range(1, num_groups + 1):
        size_of_component = (labels == i).sum()
        if size_of_component > component_threshold_px:
            labels[labels == i] = 0
    return labels

img = Image.open('test216_img.png')
img = filter_image(img)
print(pytesseract.image_to_string(img, config='--psm 10'))

This is what the provided image looks like after pre-processing:

Tesseract still can't read this on default settings. In order to read this, you'll need to use PSM 10, which corresponds to "Treat the image as a single character."

print(pytesseract.image_to_string(img, config='--psm 10'))

Which prints:

Answered By - Nick ODell

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 22, 2023

[FIXED] How to OCR single character with tesseract?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels