Wednesday, November 17, 2021

[FIXED] how to improve pytesseract arguments to work properly

November 17, 2021 python, python-tesseract No comments

Issue

I would like to read this captcha using pytesseract:

I follow the advice here: Use pytesseract OCR to recognize text from an image

My code is:

import pytesseract
import cv2

def captcha_to_string(picture):
    image = cv2.imread(picture)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    blur = cv2.GaussianBlur(gray, (3,3), 0)
    thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

    # Morph open to remove noise and invert image
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
    opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
    invert = 255 - opening

    cv2.imwrite('thresh.jpg', thresh)
    cv2.imwrite('opening.jpg', opening)
    cv2.imwrite('invert.jpg', invert)

    # Perform text extraction
    text = pytesseract.image_to_string(invert, lang='eng', config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
    return text

But my code returns 8\n\x0c which is nonsense.

This is how thresh looks like:

This is how opening looks like:

This is how invert looks like:

Can you help me, how can I improve captcha_to_string function to read the captcha properly? Thanks a lot.

Solution

You are on the right way. Removing the noise (small black spots in the inverted image) looks like the way to extract the text successfully.

FYI, the configuration of pytessearct makes the outcome worse only. So, I removed it.

My approach is as follows:

import pytesseract
import cv2
import matplotlib.pyplot as plt
import numpy as np 

def remove_noise(img,threshold):
    """
    remove salt-and-pepper noise in a binary image
    """
    filtered_img = np.zeros_like(img)
    labels,stats= cv2.connectedComponentsWithStats(img.astype(np.uint8),connectivity=8)[1:3]

    label_areas = stats[1:, cv2.CC_STAT_AREA]
    for i,label_area in enumerate(label_areas):
        if label_area > threshold:
            filtered_img[labels==i+1] = 1
    return filtered_img


def preprocess(img_path):
    """
    convert the grayscale captcha image to a clean binary image
    """
    img = cv2.imread(img_path,0)
    blur = cv2.GaussianBlur(img, (3,3), 0)

    thresh = cv2.threshold(blur, 150, 255, cv2.THRESH_BINARY_INV)[1]

    filtered_img = 255-remove_noise(thresh,20)*255

    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
    erosion = cv2.erode(filtered_img,kernel,iterations = 1)
    return erosion

def extract_letters(img):
    text = pytesseract.image_to_string(img)#,config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

    return text


img = preprocess('captcha.jpg')

text=extract_letters(img)
print(text)

plt.imshow(img,'gray')
plt.show()

This is the processed image.

And, the script returns 18L9R.

Answered By - Prefect

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 17, 2021

[FIXED] how to improve pytesseract arguments to work properly

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels