Issue
I'm trying to extract the number from the image string given below
![](https://i.stack.imgur.com/hy3aJ.png)
I have no problem in extracting digits from normal text, but the digit in the above strip seems to be a picture within a picture. This is the code I'm using to extract the digit.
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = Image.open(r"C:\Users\UserName\PycharmProjects\COLLEGE PROJ\65.png")
text=pytesseract.image_to_string(img, config='--psm 6')
file = open("c.txt", 'w')
file.write(text)
file.close()
print(text)
I've tries all possible psm from 1 to 13, and they all display just week. The code works if I crop out just the digit. But my project requires me to extract it from a similar strip. Could someone please help me? I've been stuck on this aspect of my project for some time now.
I've attached the complete image in case it would help anyone understand the problem better.
![](https://i.stack.imgur.com/BWSFU.jpg)
I can extract digits in the texts to the right, but I am not able to extract it from the left most week strip!
Solution
First you need to apply adaptive-thresholding
with bitwise-not
operation to the image.
After adaptive-thresholding
:
After bitwise-not
:
To know more about those operations, you can look at Morphological Transformations, Arithmetic Operations and Image Thresholding.
Now we need to read column by column.
Therefore, to set column-by-column reading we need page-segmentation-mode 4:
"4: Assume a single column of text of variable sizes." source
Now when we read:
txt = pytesseract.image_to_string(bnt, config="--psm 4")
Result:
WEEK © +4 hours te complete
5 Software
in the fifth week af this course, we'll learn about tcomputer software. We'll learn about what software actually is and the
.
.
.
We have a lot of informations, we want only the 5 and 6 values.
The logic is: if WEEK
string is available in the current sentence, get the next line and print:
txt = txt.strip().split("\n")
get_nxt_ln = False
for t in txt:
if t and get_nxt_ln:
print(t)
get_nxt_ln = False
if "WEEK" in t:
get_nxt_ln = True
Result:
5 Software
: 6 Troubleshooting
Now to get only the integers, we can use regular-expression
t = re.sub("[^0-9]", "", t)
print(t)
Result:
5
6
Code:
import re
import cv2
import pytesseract
img = cv2.imread("BWSFU.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY_INV, 11, 2)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 4")
txt = txt.strip().split("\n")
get_nxt_ln = False
for t in txt:
if t and get_nxt_ln:
t = re.sub("[^0-9]", "", t)
print(t)
get_nxt_ln = False
if "WEEK" in t:
get_nxt_ln = True
Answered By - Ahx
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.