Saturday, March 26, 2022

[FIXED] Pytesseract not detecting a digit which might be a picture within a picture

March 26, 2022 ocr, python, python-tesseract No comments

Issue

I'm trying to extract the number from the image string given below

I have no problem in extracting digits from normal text, but the digit in the above strip seems to be a picture within a picture. This is the code I'm using to extract the digit.

import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = Image.open(r"C:\Users\UserName\PycharmProjects\COLLEGE PROJ\65.png")
text=pytesseract.image_to_string(img, config='--psm 6')
file = open("c.txt", 'w')
file.write(text)
file.close()
print(text)

I've tries all possible psm from 1 to 13, and they all display just week. The code works if I crop out just the digit. But my project requires me to extract it from a similar strip. Could someone please help me? I've been stuck on this aspect of my project for some time now.

I've attached the complete image in case it would help anyone understand the problem better.

I can extract digits in the texts to the right, but I am not able to extract it from the left most week strip!

Solution

First you need to apply adaptive-thresholding with bitwise-not operation to the image.

After adaptive-thresholding:

After bitwise-not:

To know more about those operations, you can look at Morphological Transformations, Arithmetic Operations and Image Thresholding.

Now we need to read column by column.

Therefore, to set column-by-column reading we need page-segmentation-mode 4:

"4: Assume a single column of text of variable sizes." source

Now when we read:

txt = pytesseract.image_to_string(bnt, config="--psm 4")

Result:

WEEK © +4 hours te complete

5 Software

in the fifth week af this course, we'll learn about tcomputer software. We'll learn about what software actually is and the
.
.
.

We have a lot of informations, we want only the 5 and 6 values.

The logic is: if WEEK string is available in the current sentence, get the next line and print:

txt = txt.strip().split("\n")
get_nxt_ln = False
for t in txt:
    if t and get_nxt_ln:
        print(t)
        get_nxt_ln = False
    if "WEEK" in t:
        get_nxt_ln = True

Result:

5 Software
: 6 Troubleshooting

Now to get only the integers, we can use regular-expression

t = re.sub("[^0-9]", "", t)
print(t)

Result:

5
6

Code:

import re
import cv2
import pytesseract

img = cv2.imread("BWSFU.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY_INV, 11, 2)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 4")
txt = txt.strip().split("\n")
get_nxt_ln = False
for t in txt:
    if t and get_nxt_ln:
        t = re.sub("[^0-9]", "", t)
        print(t)
        get_nxt_ln = False
    if "WEEK" in t:
        get_nxt_ln = True

Answered By - Ahx

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, March 26, 2022

[FIXED] Pytesseract not detecting a digit which might be a picture within a picture

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels