Issue
I'm having a hard time extracting the text CHUBB from this image above. I have attempted several image preprocessing techniques and using pytesseract to extract but no success.
My Output: '\x0c'
Expected output: 'CHUBB'
Any help would be appreciated
My attempt:
import pytesseract
img = cv2.imread('image1_1.png')
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh1 = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 199, 5)
cv2.imshow('Adaptive Mean', thresh1)
# De-allocate any associated memory usage
if cv2.waitKey(0) & 0xff == 27:
cv2.destroyAllWindows()
# Adding custom options
custom_config = r' --psm 3'
pytesseract.image_to_string(thresh1, config=custom_config)```
Solution
I think the problem is that the text CHUBB is too large for the picture. If we decrease the size a little bit or paste it into a larger canvas, then pytesseract will work fine
from PIL import Image
img = Image.open('test.png') # load image
new_img = Image.new('RGB', (400, 400), color = 'white') # create a larger canvas
new_img.paste(im=img, box=(100,100), mask=img) # paste original CHUBB in the large image
text = pytesseract.image_to_string(new_img, lang='eng', config='--psm 12') # OCR
print(text) # CHUBB
FYI
for i in range(1,14):
try:
text = pytesseract.image_to_string(new_img, lang='eng',config=f"--psm {i}") # OCR
print('psm',i, text)
except:
pass
Yield
psm 1 CHUBB
psm 3 CHUBB
psm 4 CHUBB
psm 5 0
u
J
I
U
psm 6 CHUBB
psm 7 CHUBB
psm 8 7
psm 9 CHUBB
psm 10 CHUBB
psm 11 CHUBB
psm 12 CHUBB
psm 13 7
Answered By - Yu Kuo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.