Sunday, November 21, 2021

[FIXED] Pytesseract reading receipt

November 21, 2021 ocr, python, python-tesseract No comments

Issue

I have tried to read text from image of receipt using pytesseract. But a result text have a lot weird characters and it really looks awful. There is my code which i used to manipulate image:

import sys
from PIL import Image
import cv2 as cv
import numpy as np
import pytesseract


def manipulate_image(img):
    img =  cv.cvtColor(img, cv.COLOR_BGR2GRAY)
    kernel = np.ones((1,1), dtype = "uint8") 
    img = cv.erode(img, kernel, iterations = 1) 
    img = cv.threshold(img, 0, 255,
        cv.THRESH_BINARY | cv.THRESH_OTSU)[1]
    img = cv.medianBlur(img, 3)
    return img


if len(sys.argv) > 2:
    print("Please provide only name of image.")
elif len(sys.argv) == 2:
    img = cv.imread(sys.argv[1])

    img = manipulate_image(img)
    cv.imwrite("test.png", img)
    text = pytesseract.image_to_string(img)
    print text.encode('utf8')
else:
    print("Please provide name of image.")

there is my test receipt image: https://imgur.com/a/RjeQ9dL and there is output image after manupulate: https://imgur.com/a/1tFZRdq and there is text result:

""'9vco4v‘l7

0 .Vt3t00N 00t300N BUNUUS



SKLEP PUU POPUGOH|
UL. JHGIELLUNSKA 25, 70-364 SZCZ[C|N
TEL. 91 4841-20-58
N|P: 955—150-21-B2
dn.19r03.05 Uydr.8534
PARAGON FISKALNY
CIHSTKH 17 0,3 ¥ 16,30 = 4.89 B
Sp.0p.B 4,89 PTU B= 8,00% 0,35
Razem PTU 0,35
ZOP{HCUNU GUTUNKQ PLN
RESZTA PLN
0025/1373 H0103 0N|0 H.
15F H9HF[B9416} 13ﬂ02D6k0[20D4334C
7?? BW 140

Any idea how to perform it in better way to get nicer results?

Solution

Applying simple thresholding will not be enough for pyTesseract to properly detect the characters. There is much more preprocessing that can be done to drastically improve your results, such as:

using Tesseract V4, where deep learning is implemented
segmenting characters
using only the part of the receipt where the text is through edge detection
perspective transform to straighten out the text

These are somewhat lengthy topics to write all in one answer, but you can check out some articles on pyImageSearch, where this is talked about in much more depth:

https://www.pyimagesearch.com/2014/09/01/build-kick-ass-mobile-document-scanner-just-5-minutes/ https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/

Answered By - Sergey Ronin

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, November 21, 2021

[FIXED] Pytesseract reading receipt

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels