Monday, July 11, 2022

[FIXED] Detecting Bangla character using pytesseract

July 11, 2022 python, python-tesseract No comments

Issue

I am trying to detect bangla character from image using python, so i decided to use pytesseract. For this purpose i have used below code:

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

im = Image.open("input.png") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save('temp2.png')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('temp2.png'),lang="ben")
print text

The problem is that if i gave a image of english character is detects. But when i am writing lang="ben" and detecting from image of bengali characters my code is running for endless time or like forever.

P.S: I have downloaded bengali language train data to tessdata folder and i am trying to run it in PyCharm.

Can anyone help me to solve this problem?

sample of input.png

Solution

I added Bangla(india) language to Windows. Downloaded ben.traineddata to TESSDATA_PREFIX which equals to C:\Program Files\Tesseract 4.0.0\tessdata in my PC. Then run,

> tesseract -l ben bangla.jpg bangla_out

in command prompt and got the result below in 2 seconds. The result looks fine even I don't understand the language.

Have you tried to run tesseract in command prompt to verify if it works for -l ben?

EDIT:

Used Spyder, similar to PyCharm, which comes with Anaconda to test it. Modified your code to call Tesseract as below.

pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"

Test Code in Spyder:

import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
import os

im = Image.open("bangla.jpg") # the second one
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
im.save("bangla_pp.jpg")

pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract 4.0.0/tesseract.exe"
text = pytesseract.image_to_string(Image.open("bangla_pp.jpg"),lang="ben")
print text

It works and produced result below on the processed image. Apparently, the OCR result of the processed image is not as good as the original one.

Result from the processed bangla_pp.jpg:

   প্রত্যাবর্তনকারীরা
   তাঁদের দেশে গিয়ে

   -~~-<~~~~--

   প্রত্যাবর্তন-পরবর্তী
   আর্থিক সহায়তা
    = পাবেন তার

Result from original image, directly feed to Tesseract.

Code:

from PIL import Image    
import pytesseract as tess

print tess.image_to_string(Image.open('bangla.jpg'), lang='ben')

Output:

প্রত্যাবর্তনকারীরা
তাঁদের দেশে গিয়ে

প্রত্যাবর্তন-পরবর্তী
আর্থিক সহায়তা
পাবেন তার

Answered By - thewaywewere

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, July 11, 2022

[FIXED] Detecting Bangla character using pytesseract

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels