Wednesday, November 22, 2023

[FIXED] pytesseract 5.0.0 returns non sense results for mix numbers and letters

November 22, 2023 ocr, python, python-tesseract, tesseract No comments

Issue

The Problem:

I would like to extract the text, which is the mix of letters and numbers, from the such images:

As it can be seen, the images may be in various orientations, and sometimes they contain noise like the first one with some white circles and so on. But the text always starts with letters 'BF' and followed by 10 digits. I think this should be easily feasible by tesseract. Still somehow it does not work!!

Solution t I have tired so far. First the pytesseract version as it seems to be important from what I have searched (with Python 3.7.3):

import pytesseract
pytesseract.get_tesseract_version()
'5.0.0-alpha.20190708'

From this answer and this one, I have tried configs that supposedly should work with mix numbers and letters like below:

from PIL import Image
import pytesseract
print(pytesseract.image_to_string(Image.open('image.jpg')), config='tessedit_char_whitelist=01234ABCDEF'))

BUT Results:

First image: 'SALT LB:\n\nbe) be)'
Second image: ''
Third image: 'OS26S0S061 38'

Which are horrible. I have tried various combinations of the config, but nothing works! I also confirm that these texts can be easily extracted by free online version of online Cognitive Services like Azure Cognitive Services, so images themselves are not the problem, I think I struggle with the right configs in pytesseract or maybe latest version has bugs!!

Solution

pytesseract is a wrapper for tesseract

From the tesseract documentation

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn’t good enough, which can result in a significant reduction in accuracy.

You can see how Tesseract has processed the image by using the configuration variable tessedit_write_images to true (or using configfile get.images) when running Tesseract. If the resulting tessinput.tif file looks problematic, try some of these image processing operations before passing the image to Tesseract.

So the images need to have correct orientation (not rotated).

Following python code takes a rotation fixed sample of your images "7YpyGm_rotation_fixed.jpg", convert to grayscale, binarize and invert before applying OCR.

It uses OpenCV for image manipulation which you can install with:

pip install opencv-contrib-python

import cv2
import pytesseract

# read image
img = cv2.imread("7YpyGm_rotation_fixed.jpg")
# convert to grayscale
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# binarize with threshold and invert image
_, img_bin = cv2.threshold(img, 60, 255, cv2.THRESH_BINARY_INV)

pytesseract.image_to_string(img_bin)

The result is 'BF 1905059748 ‘\n'

Answered By - Javi12

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 22, 2023

[FIXED] pytesseract 5.0.0 returns non sense results for mix numbers and letters

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels