Wednesday, June 8, 2022

[FIXED] Turning off English dictionary word for pytessaract (for an alpr system)

June 08, 2022 image-processing, python, python-tesseract, tesseract No comments

Issue

I am using pytessaract to do an image to text conversion of a numberplate for something like this

number plate

tools = pyocr.get_available_tools()
if len(tools) == 0:
    print("No OCR tool found")
    exit(1)

# The tools are returned in the recommended order of usage
tool = tools[0]
print("Will use tool '%s'" % (tool.get_name()))
# Ex: Will use tool 'libtesseract'

langs = tool.get_available_languages()
print("Available languages: %s" % ", ".join(langs))
lang = langs[0]
print("Will use lang '%s'" % (lang))

This is how i read it I whitelist all the characters that it could be

text = pytesseract.image_to_string(Image.open('images/text.jpg'), config= "-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ")

Right now pytessaract is reading this as if it was looking for a dictionary word and this is giving less than desirable result There is a way to turn of dictionary words but i cannot figure out how to do it in python That is my question Thanks

Solution

Add config file with disabled system and frequent DAWG

load_system_dawg     F
load_freq_dawg       F

Config files should be placed in tessdata/configs directory (ex: tessdata/configs/config) and passed to tesseract during Init procedure.
I am not 100% confident how it is done with pytesseract but I believe you can elaborate here.

init() function signature is something like that:

const char *    datapath,
const char *    language,
OcrEngineMode   oem,
char **     configs,
int     configs_size,
const GenericVector< STRING > *     vars_vec,
const GenericVector< STRING > *     vars_values,
bool    set_only_non_debug_params

So you need to set configs to pointer to pointer to "config" and configs_size to 1

So probably something like that, you can elaborate to make this working:

api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_TESSERACT_ONLY, POINTER(ctypes.c_char_p("config")), 1, None, None, False)

EDIT:
Also note that disabling DAWG might not solve your issue. If I were you - I would simply iterate over results' alternatives and take the letter with highest confidence (if DAWG search is on - default letters would not always be the ones with highest confidence) & work more on improving input image quality as described here.

Answered By - Dmitrii Z.

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 8, 2022

[FIXED] Turning off English dictionary word for pytessaract (for an alpr system)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels