Issue
I am trying to detect and grab text from a screenshot taken from any consumer product's ad.
My code works at a certain accuracy but fails to make bounding boxes around the skewed text area.
Recently I tried Google Vision API and it makes bounding boxes around almost every possible text area and detects text in that area with great accuracy. I am curious about how can I achieve the same or similar!
My test image:
Google Vision API after bounding boxes:
Thank you in advance:)
Solution
There are a few open source vision packages that are able to detect text in noisy background images, comparable to Google's Vision API.
You can use a Fixed Convolution Layer simple architecture called EAST (Efficient and Accurate Scene Text Detector) by Zhou et al. https://arxiv.org/abs/1704.03155v2
Using Python:
Download the Pre-trained model from: https://www.dropbox.com/s/r2ingd0l3zt8hxs/frozen_east_text_detection.tar.gz?dl=1 . Extract the model to your current folder.
You will need OpenCV >= 3.4.2 to execute the below commands.
import cv2
import math
net = cv2.dnn.readNet("frozen_east_text_detection.pb") #This is the model we get after extraction
frame = cv2.imread(<image_filename>)
inpWidth = inpHeight = 320 # A default dimension
# Preparing a blob to pass the image through the neural network
# Subtracting mean values used while training the model.
image_blob = cv2.dnn.blobFromImage(frame, 1.0, (inpWidth, inpHeight), (123.68, 116.78, 103.94), True, False)
Now we will have to define the output layers which churns out the positional values of the detected text and its confidence Score (through the Sigmoid Function)
output_layer = []
output_layer.append("feature_fusion/Conv_7/Sigmoid")
output_layer.append("feature_fusion/concat_3")
Finally we will do a Forward Propagation through the network to get the desired output.
net.setInput(image_blob)
output = net.forward(output_layer)
scores = output[0]
geometry = output[1]
Here i have used the decode function defined in opencv's github page, https://github.com/opencv/opencv/blob/master/samples/dnn/text_detection.py to convert the positional values into box coordinates. (line 23 to 75).
For box detection threshold i have used a value of 0.5 and for Non Max Suppression i have used 0.3. You can try different values to achieve better bounding boxes.
confThreshold = 0.5
nmsThreshold = 0.3
[boxes, confidences] = decode(scores, geometry, confThreshold)
indices = cv2.dnn.NMSBoxesRotated(boxes, confidences, confThreshold, nmsThreshold)
Lastly, to overlay the boxes over the detected text in image:
height_ = frame.shape[0]
width_ = frame.shape[1]
rW = width_ / float(inpWidth)
rH = height_ / float(inpHeight)
for i in indices:
# get 4 corners of the rotated rect
vertices = cv2.boxPoints(boxes[i[0]])
# scale the bounding box coordinates based on the respective ratios
for j in range(4):
vertices[j][0] *= rW
vertices[j][1] *= rH
for j in range(4):
p1 = (vertices[j][0], vertices[j][1])
p2 = (vertices[(j + 1) % 4][0], vertices[(j + 1) % 4][1])
cv2.line(frame, p1, p2, (0, 255, 0), 3)
# To save the image:
cv2.imwrite("maggi_boxed.jpg", frame)
I have not experimented with different values of threshold. Changing them will surely give better result and also remove the misclassifications of the logo as text.
Note: The model was trained on English corpus, so Hindi words will not be detected. Also you can read the paper which outlines the test datasets it was bench-marked on.
Answered By - Fleron-X
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.