Issue
I startetd an ocr project a few days ago. The input image is a really noisy gray image with white letters. With the EAST text detector it is possible to recognize the text and draw borders around. After that i crop the rectangle do some image processing. After that, I pass the processed parts to pytesseract, but with bad results. Images and source vode is below. Maybe some have a good idea for better image processing and/or pytesseract settings.
Images
Input image
Rectangles after Recognition
First part
Second part
Third part
Tesseract Result AY U N74 O54
Sourcecode for image processing
kernel = cv2.getStructuringElement(cv2.MORPH_RECT , (8,8))
kernel2 = np.ones((3,3),np.uint8)
kernel3 = np.ones((5,5),np.uint8)
gray = cv2.cvtColor(cropped, cv2.COLOR_BGR2GRAY)
gray = cv2.resize(gray, None, fx=7, fy=7)
gray = cv2.GaussianBlur(gray, (5,5), 1)
#cv2.medianBlur(gray, 5)
gray = cv2.dilate(gray, kernel3, iterations = 1)
gray = cv2.erode(gray, kernel3, iterations = 1)
gray = cv2.morphologyEx(gray, cv2.MORPH_DILATE, kernel3)
gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
gray = cv2.bitwise_not(gray)
ts_img = Image.fromarray(gray)
txt = pytesseract.image_to_string(ts_img, config='--oem 3 --psm 12 -c tessedit_char_whitelist=12345678ABCDEFGHIJKLMNOPQRSTUVWXYZ load_system_dawg=false load_freq_dawg=false')
I tried some other psm settings like psm 11, psm 8 and ps6. The results are different, but also bad. I guess the biggest problem are the black spots which are connected with the letters and digits but I have no idea how to remove them. I appreciate every help :)
Solution
OCR software will perform poorly when interpreting this text as a word or sentence because it's expecting real English words and not a random combination of characters. I'd recommend analyzing the text as individual characters. I solved the (example) problem by first determining which groups of labeled pixels (connected components of a thresholded image) are characters based on the size and location of the group. Then for each image portion containing a (single) character I use easyocr
to obtain the character. I found that pytesseract
performs poorly or not at all on single characters (even when setting --psm 10
and other arguments). The code below produces this result:
OCR out: 6UAE005X0721295
import cv2
import matplotlib.pyplot as plt
import numpy as np
import easyocr
reader = easyocr.Reader(["en"])
# Threshold image and determine connected components
img_bgr = cv2.imread("C5U3m.png")
img_gray = cv2.cvtColor(img_bgr[35:115, 30:], cv2.COLOR_BGR2GRAY)
ret, img_bin = cv2.threshold(img_gray, 195, 255, cv2.THRESH_BINARY_INV)
retval, labels = cv2.connectedComponents(255 - img_bin, np.zeros_like(img_bin), 8)
fig, axs = plt.subplots(4)
axs[0].imshow(img_gray, cmap="gray")
axs[0].set_title("grayscale")
axs[1].imshow(img_bin, cmap="gray")
axs[1].set_title("thresholded")
axs[2].imshow(labels, vmin=0, vmax=retval - 1, cmap="tab20b")
axs[2].set_title("connected components")
# Find and process individual characters
OCR_out = ""
all_img_chars = np.zeros((labels.shape[0], 0), dtype=np.uint8)
labels_xmin = [np.argwhere(labels == i)[:, 1].min() for i in range(0, retval)]
# Process the labels (connected components) from left to right
for i in np.argsort(labels_xmin):
label_yx = np.argwhere(labels == i)
label_ymin = label_yx[:, 0].min()
label_ymax = label_yx[:, 0].max()
label_xmin = label_yx[:, 1].min()
label_xmax = label_yx[:, 1].max()
# Characters are large blobs that don't border the top/bottom edge
if label_yx.shape[0] > 250 and label_ymin > 0 and label_ymax < labels.shape[0]:
img_char = img_bin[:, label_xmin - 3 : label_xmax + 3]
all_img_chars = np.hstack((all_img_chars, img_char))
# Use EasyOCR on single char (pytesseract performs poorly on single characters)
OCR_out += reader.recognize(img_char, detail=0)[0]
axs[3].imshow(all_img_chars, cmap="gray")
axs[3].set_title("individual characters")
fig.show()
print("Thruth: 6UAE005X0721295")
print("OCR out: " + OCR_out)
Answered By - Bart van Otterdijk
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.