Issue
I have an image I want to extract text from using tesseract and python. I only want to recognize a certain set of characters so I use tessedit_char_whitelist=1234567890CBDE
as a config. However now tesseract doesnt seem to recognize the gaps between the lines anymore. Is there some character I can add to the whitelist so it recognizes the text as individual text again?
Here is the image after the whitelist:
Here is the image before the whitelist:
Here is the code responsible for drawing the boxes and the recognizing the characters in case youre wondering:
#configuring parameters for tesseract
# whitlist = "-c tessedit_char_whitelist=1234567890CBDE"
custom_config = r'--oem 3 --psm 6 '
# now feeding image to tesseract
details = pytesseract.image_to_data(threshold_img, output_type=Output.DICT, config=custom_config, lang='eng')
print(details.keys())
total_boxes = len(details['text'])
for sequence_number in range(total_boxes):
# confidence above 30 %
CONFIDENCE = 0
if int(details['conf'][sequence_number]) >= CONFIDENCE:
(x, y, w, h) = (details['left'][sequence_number], details['top'][sequence_number], details['width'][sequence_number], details['height'][sequence_number])
threshold_img = cv2.rectangle(threshold_img, (x, y), (x + w, y + h), (0, 255, 0), 2)
# display image
cv2.imshow('captured text', threshold_img)
cv2.imwrite("before.png", threshold_img)
# Maintain output window until user presses a key
cv2.waitKey(0)
# Destroying present windows on screen
cv2.destroyAllWindows()
EDIT:
Here is the original image I want to extract the text from with the goal to write it to a matrix:
The desired matrix would take the following form:
content = [
["1C", "55", "55", "E9", "BD"],
# ...
["1C", "1C", "55", "BD", "BD"]
]
Solution
One Solution is:
-
- Individually take each tuple and upsample by 2
-
- Apply threshold
-
- Recognize by setting page-segmentation-mode to 6
Tuple | |||||
Threshold | |||||
Result | 1C | 55 | 55 | E9 | BO |
Tuple | |||||
Threshold | |||||
Result | 1C | 1C | 55 | BO | 1C |
Tuple | |||||
Threshold | |||||
Result | 1C | 55 | BO | 55 | IC |
Tuple | |||||
Threshold | |||||
Result | 1C | BD | 50 | 1C | 1C |
Tuple | |||||
Threshold | |||||
Result | 1C | 1C | 55 | BD | BD |
The idea is taking each tuple separately, upsampling it, and then applying inverse-binary-threshold. Tesseract misinterpreted few tuples due to the font. For instance, if you look at the character D
which looks like O
. If you want 100% accuracy, then I suggest you train the tesseract. Also, make sure you try with other page-segmentation-modes
Here is the array output:
[['1C', '55', '55', 'E9', 'BO'], ['1C', '1C', '55', 'BO', '1C'], ['1C', '55', 'BO', '55', 'IC'], ['1C', 'BD', '50', '1C', '1C'], ['1C', '1C', '55', 'BD', 'BD']]
Code:
import cv2
import pytesseract
img = cv2.imread("IVemF.png")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
s_idx1 = 0 # start index1
e_idx1 = int(h/5) # end index1
cfg = "--psm 6"
res = []
for _ in range(0, 5):
s_idx2 = 0 # start index2
e_idx2 = int(w / 5) # end index2
row = []
for _ in range(0, 5):
crp = gry[s_idx1:e_idx1, s_idx2:e_idx2]
(h_crp, w_crp) = crp.shape[:2]
crp = cv2.resize(crp, (w_crp*2, h_crp*2))
thr = cv2.threshold(crp, 0, 255,
cv2.THRESH_BINARY_INV |
cv2.THRESH_OTSU)[1]
txt = pytesseract.image_to_string(thr,
config=cfg)
txt = txt.replace("\n\x0c", "")
row.append(txt.upper())
print(txt.upper())
s_idx2 = e_idx2
e_idx2 = s_idx2 + int(w/5)
cv2.imshow("thr", thr)
cv2.waitKey(0)
res.append(row)
s_idx1 = e_idx1
e_idx1 = s_idx1 + int(h/5)
print(res)
Answered By - Ahx
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.