Sunday, June 19, 2022

[FIXED] How to OCR a text with white colour characters on a blue background from a cropped image?

June 19, 2022 ocr, opencv, python, python-tesseract No comments

Issue

First, I want to crop an image using a mouse event, and then print the text inside the cropped image. I tried OCR scripts but all can't work for this image attached below. I think the reason is that the text has white characters on blue background.

Can you help me with doing this?

Full image:

Cropped image:

An example what I tried is:

import pytesseract
import cv2
import numpy as np

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

img = cv2.imread('D:/frame/time 0_03_.jpg')

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
adaptiveThresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 35, 30)
inverted_bin=cv2.bitwise_not(adaptiveThresh)

#Some noise reduction
kernel = np.ones((2,2),np.uint8)
processed_img = cv2.erode(inverted_bin, kernel, iterations = 1)
processed_img = cv2.dilate(processed_img, kernel, iterations = 1)
 
#Applying image_to_string method
text = pytesseract.image_to_string(processed_img)
 
print(text)

Solution

[EDIT]

For anyone wondering, the image in the question was updated after posting my answer. That was the original image:

Thus, the below output in my original answer.

That's the newly posted image:

The specific Turkish characters, especially in the last word, are still not properly detected (since I still can't use lang='tur' right now), but at least the Ö and Ü can be detected using lang='deu', which I have installed:

text = pytesseract.image_to_string(mask, lang='deu').strip().replace('\n', '').replace('\f', '')
print(text)
# GÖKYÜZÜ AVCILARI ILE TEKE TEK KLASIGI

[/EDIT]

I wouldn't use cv2.adaptiveThreshold here, but simple cv2.threshold using cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV. Since, the comma touches the image border, I'd add another, one pixel wide border via cv2.copyMakeBorder to capture the comma properly. So, that would be the full code (replacing \f is due to my pytesseract version only):

import cv2
import pytesseract

img = cv2.imread('n7nET.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
mask = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY_INV)[1]
mask = cv2.copyMakeBorder(mask, 1, 1, 1, 1, cv2.BORDER_CONSTANT, 0)
text = pytesseract.image_to_string(mask).strip().replace('\n', '').replace('\f', '')
print(text)
# 2020'DE SALGINI BILDILER, YA 2021'DE?

The output seems correct to me – of course, not for this special (I assume Turkish) capital I character with the dot above. Unfortunately, I can't run pytesseract.image_to_string(..., lang='tur'), since it's simply not installed. Maybe, have a look at that to get the proper characters here as well.

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.16299-SP0
Python:        3.9.1
PyCharm:       2021.1.1
OpenCV:        4.5.1
pytesseract:   5.0.0-alpha.20201127
----------------------------------------

Answered By - HansHirse

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, June 19, 2022

[FIXED] How to OCR a text with white colour characters on a blue background from a cropped image?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels