Sunday, March 20, 2022

[FIXED] Pytesseract (Tesseract OCR) not picking up some numbers

March 20, 2022 computer-vision, cv2, ocr, python, python-tesseract No comments

Issue

I've been working on a program to read financial statements using optical character recognition, and for the life of me I can't figure out why some numbers are still not being read by the open source module I'm using.

I created an output file with green boxes around the original input where text is being detected. In this case, the line with "381" is picked up, but the line below (which has the same exact format) is being ignored.

I'm using this code to preprocess the image before extracting the data, since previously the miss rate was as high as 20%, now it's closer to 5%.

img = cv2.imread(filename)
img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

After this preprocessing I also run an algorithm to remove solid lines past a certain size from the document, but in this case neither "35" or "381" are underlined in the original so I doubt this is causing the issue. I've also verified that the top part of the 5 isn't being removed by the line detection algorithm.

I'm no expert in OCR or CV, my specialty is more data and general purpose programming--I really just need to get this library to do the job it's advertised to do so that I can move on and finish the program. Does anyone have an idea what could possibly be causing this issue?

Solution

I would suggest looking into setting your configuration to a specific page segmentation method (PSM), such as 11 since you're looking for sparse text. For example, I have for my code:

results = pytesseract.image_to_data(Image.open(tempFile), lang='eng', config='--psm 11', output_type=Output.DICT)

PSMs are as follows:

  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
                        bypassing hacks that are Tesseract-specific.

There's also a method of searching by numbers instead of by text which may help as well.

Answered By - MontX

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, March 20, 2022

[FIXED] Pytesseract (Tesseract OCR) not picking up some numbers

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels