Issue
Given an input image which can be in any language or writing system, how do I detect what script the text in the picture uses?
Any Python-based or Tesseract-OCR based solution would be appreciated.
Note that script here means writing systems like Latin, Cyrillic, Devanagari, etc., for corresponding languages like English, Russian, Hindi, etc. (respectively)
Solution
Pre-requisites:
- Install Tesseract:
sudo apt install tesseract-ocr tesseract-ocr-all
- Install PyTessract:
pip install pytesseract
Script-Detection:
import pytesseract
import re
def detect_image_lang(img_path):
try:
osd = pytesseract.image_to_osd(img_path)
script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1)
conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)
return script, float(conf)
except e:
return None, 0.0
script_name, confidence = detect_image_lang("image.png")
Language-Detection:
After performing OCR (using Tesseract), pass the text through langdetect
library (or any other lib).
Answered By - Gokul NC
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.