Tuesday, October 4, 2022

[FIXED] Pytesseract - Not Detecting simple black text on white background

October 04, 2022 ocr, python, python-tesseract No comments

Issue

I am cropping sections from a larger image to be scanned by OCR. The first of two cropped sections is detected ok. Here is a saved jpeg of the first section:

I have this other cropped section which pytesseract is absolutely clueless about:

I use the same code to scan the images:

from PIL import Image
from matplotlib import image
import pytesseract


def get_crop_as_text(page, left, upper, right, lower, debug_out_nm = ''):
    img = page.crop((left, upper, right, lower))
    # img.save('test_crop' + debug_out_nm + '.jpg', 'JPEG')
    txt = str(pytesseract.image_to_string(img))
    txt = txt.replace('\n','')
    return txt


im = Image.open(dat_file)
id = get_crop_as_text(im, 785, 486, 1492, 589, '_id_')
rrg = get_crop_as_text(im, 1372, 3791, 1482, 3853, '_rrg_')

'id' returns '1001' as expected. The second returns ''.

I have locally saved the crops and then done the scanning of each individual saved file. In that case, the '-2.0' is detected sometimes. Literally from the same file, same method, etc. It is just hit or miss and I can't figure why.

Solution

A few notes on what finally worked:

Switched to easyocr.
The latest version of OpenCV gave issues with easyocr. had to downgrade to version OpenCV ver 4.5.4.60.
Converting the image to a Numpy array did not help detecting the minus sign. Instead, had to save to a temp file and then run the ocr on the file

Sorry about the spacing. SO gave some weird reaction when I hit its "code" button.

def get_crop_as_text(page, left, upper, right, lower, debug_out_nm = ''):

    CROP_FILE = 'crop.jpg'

    reader = easyocr.Reader(['en'], gpu=False)

    txt = ''

    try:

        img = page.crop((left, upper, right, lower))

        width, height = img.size

        # img = img.resize((width*10, height*10))

        img.save(CROP_FILE,'JPEG')

        result = reader.readtext(CROP_FILE)

        txt = result[0][1]

        txt = txt.replace('\n','')

    except Exception as e:

        print(e)


    return txt

Answered By - Michael James

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 4, 2022

[FIXED] Pytesseract - Not Detecting simple black text on white background

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels