Saturday, September 3, 2022

[FIXED] Why can't tesseract extract text that has a black background?

September 03, 2022 ocr, opencv, python, python-tesseract No comments

Issue

I have attached a very simple text image that I want text from. It is white with a black background. To the naked eye it seems absolutely legible but apparently to tesseract it is a rubbish. I have tried changing the oem and psm parameters but nothing seems to work. Please note that this works for other images but for this one.

Please try running it on your machine and see if it works. Or else I might have to change my ocr engine altogether.

Note: It was working earlier until I tried to add black pixels around the image to help the extraction process. Also I don't think that tesseract was trained on black text on a white background. It should be able to do this too. Also if this was true why does it work for other text images that have the same format as this one

Edit: Miraculously I tried running the script again and this time it was able to extract Chand properly but failed in the below mentioned case. Also please look at the parameters I have used. I have read the documentation and I feel this would be the right choice. I have added the image for your reference. It is not about just this image. Why is tesseract failing for such simple use cases?

Solution

To find the desired result, you need to know the followings:

The input images are boldly written, we need to shrink the bold font and then assume the output as a single uniform block of text.

To shrink the images we could use erosion

Result will be:

Erode	Result
	`CHAND`
	`BAKLIWAL`

Code:

# Load the library
import cv2
import pytesseract

# Initialize the list
img_lst = ["lKpdZ.png", "ZbDao.png"]

# For each image name in the list
for name in img_lst:

    # Load the image
    img = cv2.imread(name)

    # Convert to gry-scale
    gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Erode the image
    erd = cv2.erode(gry, None, iterations=2)

    # OCR with assuming the image as a single uniform block of text
    txt = pytesseract.image_to_string(erd, config="--psm 6")
    print(txt)

Answered By - Ahx

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, September 3, 2022

[FIXED] Why can't tesseract extract text that has a black background?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels