Monday, May 16, 2022

[FIXED] OpenCV tesserocr watermark detection

May 16, 2022 image-processing, opencv, python, python-tesseract, tesseract No comments

Issue

So I have about 12000 image links in my SQL table. Point is to detect which of those images contain watermarked text and which don't. All text and borders is like this.

I've tried with OpenCV and tesserocr

img_data = requests.get(url).content
img = Image.open(BytesIO(img_data))
print(url, tesserocr.image_to_text(img).strip())

But doesn't seem it recognizes text on image at all.

My second approach was to use some external open API site.

for url,id in alldata:
body = {'language':'eng',
        'isOverlayRequired':'false',
        'url':url,
        'iscreatesearchablepdf':'false',
        'issearchablepdfhidetextlayer':'false',
        'filetype':'jpeg'}

headers = {
    'apikey':'xxxxxxx'
}
rsp =  requests.post('https://api.ocr.space/parse/image', body, headers= headers)

It works but its super slow. For 11000 images it would last few days.

Solution

tesserocr isn't detecting any text due to the small text height or small text size. By cropping the text region and using that image, pytesseract could extract the text. Using contour and dilation to detect text area didn't work either due to small text size. To detect the text region, I used EAST model to extract all regions using this solution and combined all the regions. Passing the extracted combined region image to tesseract returns the text. To run this script, You need to download the model which can be found here and install the required dependencies.
Python Script:

import numpy as np
import cv2
from imutils.object_detection import non_max_suppression
import matplotlib.pyplot as plt
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe" # I am using windows

image1 = cv2.imread("r9Do4.png",cv2.IMREAD_COLOR) 
ima_org = image1.copy()
(height1, width1) = image1.shape[:2]
size = 640 #size must be multiple of 32. Haven't tested with smaller size which can increase speed but might decrease accuracy.
(height2, width2) = (size, size)  
image2 = cv2.resize(image1, (width2, height2))  


net = cv2.dnn.readNet("frozen_east_text_detection.pb")
blob = cv2.dnn.blobFromImage(image2, 1.0, (width2, height2), (123.68, 116.78, 103.94), swapRB=True, crop=False)
net.setInput(blob)

(scores, geometry) = net.forward(["feature_fusion/Conv_7/Sigmoid", "feature_fusion/concat_3"])
(rows, cols) = scores.shape[2:4]  # grab the rows and columns from score volume
rects = []  # stores the bounding box coordiantes for text regions
confidences = []  # stores the probability associated with each bounding box region in rects

for y in range(rows):
    scoresdata = scores[0, 0, y]
    xdata0 = geometry[0, 0, y]
    xdata1 = geometry[0, 1, y]
    xdata2 = geometry[0, 2, y]
    xdata3 = geometry[0, 3, y]
    angles = geometry[0, 4, y]

    for x in range(cols):

        if scoresdata[x] < 0.5:  # if score is less than min_confidence, ignore
            continue
        # print(scoresdata[x])
        offsetx = x * 4.0
        offsety = y * 4.0
        # EAST detector automatically reduces volume size as it passes through the network
        # extracting the rotation angle for the prediction and computing their sine and cos

        angle = angles[x]
        cos = np.cos(angle)
        sin = np.sin(angle)

        h = xdata0[x] + xdata2[x]
        w = xdata1[x] + xdata3[x]
        #  print(offsetx,offsety,xdata1[x],xdata2[x],cos)
        endx = int(offsetx + (cos * xdata1[x]) + (sin * xdata2[x]))
        endy = int(offsety + (sin * xdata1[x]) + (cos * xdata2[x]))
        startx = int(endx - w)
        starty = int(endy - h)

        # appending the confidence score and probabilities to list
        rects.append((startx, starty, endx, endy))
        confidences.append(scoresdata[x])


# applying non-maxima suppression to supppress weak and overlapping bounding boxes
boxes = non_max_suppression(np.array(rects), probs=confidences)

iti=[]
rW = width1 / float(width2)
rH = height1 / float(height2)


bb = []
for (startx, starty, endx, endy) in boxes:
    startx = int(startx * rW)
    starty = int(starty * rH)
    endx = int(endx * rW)
    endy = int(endy * rH)
    cv2.rectangle(image1, (startx, starty), (endx, endy), (255, 0,0), 2)
    
    bb.append([startx, starty, endx, endy])
    

#combining the bounding boxes to get the text region

csx = 0
cex = 0
csy = 0
cey = 0
for i,box in enumerate(bb[::-1]):
    if i==0:
        csx = box[0]
    else:
        cex = box[2]
        cey = box[3]+5
        esx = box[1]
    
#print(image1)
cv2.imshow("BB img",image1)
cv2.waitKey(0)

rects.append((startx, starty, endx, endy))
confidences.append(scoresdata[x])
it=ima_org[csy:cey, csx:cex]
cv2.imshow("Cropped Text Region",it)
cv2.waitKey(0)

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
thr = cv2.threshold(src=cv2.cvtColor(it,cv2.COLOR_RGB2GRAY), thresh=0, maxval=255, type=cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]
txt = pytesseract.image_to_string(thr,lang='eng',config='--psm 11')
print(txt.strip())

Here's the output of the script
Bounding Box of the text regions

Cropped text region

Extracted Text
\

As mentioned in the post that all of your images are similar, you can extract the text from them by repurposing this script. The script is fairly fast, takes 2.1 seconds for this image.

Answered By - Adnan taufique

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, May 16, 2022

[FIXED] OpenCV tesserocr watermark detection

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels