Issue
I am working on a contract sheet with OpenCV and pytesseract. I want to extract words from this image
I am trying with getStructureElement but my code jumps on the next line in the center of the image. I'm trying to extract words from the left side of image and after extracting string from all left then move to right side of image.
The code is:
import cv2, import pytesseract, from PIL import Image
image = cv2.imread("report_name-1.jpg")
#preprocessing
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY) # grayscale
thresh = cv2.threshold(gray,150,255,cv2.THRESH_BINARY_INV) # threshold
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS,(3,3))
dilated = cv2.erode(thresh,kernel,iterations = 13) # dilate
contours, hierarchy =cv2.findContours(dilated,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_NONE) # get contours
# get rectangle bounding contour
[x,y,w,h] = cv2.boundingRect(contour)
# discard areas that are too large
if h>300 and w>300:
continue
# discard areas that are too small
if h<40 or w<40:
continue
# draw rectangle around contour on original image
cv2.rectangle(image,(x,y),(x+w,y+h),(255,0,255),2)
Solution
You can extract text from left-to-right and top-to-bottom using --psm 6
which tells Pytesseract to assume a single uniform block of text. Preprocessing is also important so we threshold to obtain a binary image with the desired foreground text in black and the background in white. Look here for other Pytesseract configuration options. After thresholding, here's the image we throw into Pytesseract
Here's the output
Limit Balance
Sep 29, 2015 $17,750.0 Oct 01, 2018 $0.00 Oct 02, 2018
0
Account Condition: Paid account/zero Account #: Delinquency 30 Days = $0.00 | 60 Days =$0.00 90+ Days =$0.00 | Derog =00
balance 4636676005495602 Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Credit Card Account Term: REV
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2016 0 0 0
2017 0 0 0 0 0 0 0 0 0 0 0 0
2018 0 0 0 0 0 0 0 0 0 B
> BMW FINANCIAL SERVICES /
2602980
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Sep 19, 2015 $27,189.00 Jul01, 2017 $0.00 Jul 21, 2017 Jul 24, 2017
Account Condition: Paid account/zero Account #: 4002206279 Delinquency 30 Days = $0.00 | 60 Days =$0.00 90+ Days =$0.00 | Derog =00
balance Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Auto Lease Account Term: 036
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2015 Cc Cc Cc Cc
2016 Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc
2017 Cc Cc Cc Cc Cc Cc B
> LEXUS FINANCIAL SERVIC /
1624210
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Mar 07, 2015 $40,342.00 Jul01, 2016 $0.00 Jul 05, 2016 Jul 31, 2016
Account Condition: Paid account/zero Account #: Delinquency 30 Days = $0.00 | 60 Days =$0.00 90+ Days =$0.00 | Derog =00
balance 70403662535410001 Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Auto Loan Account Term: 072
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014
2015 Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc
2016 Cc Cc Cc Cc Cc Cc B
> AES/SUNTRUST BANK / 9997195
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Sep 19, 2008 $12,500.00 Apr 01, 2016 $0.00 Apr 21, 2016 Apr 30, 2016
Account Condition: Paid account/zero Account #: Delinquency 30 Days = $0.00 | 60 Days =$0.00 90+ Days =$0.00 | Derog =00
balance 5046237209PA00001 Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Signer
standing
Account Type: Education Loan Account Term: 300
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 Cc Cc Cc Cc Cc Cc Cc Cc Cc
2015 Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc Cc
2016 Cc Cc Cc B
> BARCLAYS BANK DELAWARE /
1223850
Open Date Original Amount Credit Status Date Chargeoff Amount Past Due Last Paid Date Balance Date Current
Limit Balance
Apr 04, 2013 $3,500.00 Apr 01, 2016 $0.00 Oct 06, 2014 Apr 05, 2016
Account Condition: Paid account/zero Account #: 000176863399109 Delinquency 30 Days = $0.00 | 60 Days =$0.00 90+ Days =$0.00 | Derog =00
balance Counter (Past
seven years)
Payment Status: This is an account in good Responsibility: Individual
standing
Account Type: Credit Card Account Term: REV
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2014 Cc Cc Cc Cc Cc Cc Cc Cc 0
2015 0 0 0 0 0 0 0 0 0 0 0 0
2016 0 0 0 B
> AMERICAN HONDA FINANCE /
1605190
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)
Answered By - nathancy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.