Issue
I am using pytesseract with the line:
text = image_to_string(temp_test_file,
lang='eng',
boxes=False,
config='-c preserve_interword_spaces=1 hocr')
and and getting the error
pytesseract.py
135| f = open(output_file_name, 'rb')
No such file or directory:
/var/folders/j3/dn60cg6d42bc2jwng_qzzyym0000gp/T/tess_EDOHFP.txt
Looking at the source code for pytesseract here, it seems like it is not being able to find the temporary output file it uses to store the output of the tesseract command.
I have seen other answers here that have been resolved by checking that tesseract is installed and callable from a the command terminal and for me it is, so that is not the problem here. Any ideas what this could be and how to fix it? Thanks
Solution
It turns out that the reason that pytesseract was not able to find the temporary output files were that they were being stored with extensions other than .txt or .box (they were .hocr files). From the source code, these are the only types of tesseract output files supported by pytesseract (or more like 'looked for' by pytesseract). The relevant snippets from the source are below:
input_file_name = '%s.bmp' % tempnam()
output_file_name_base = tempnam()
if not boxes:
output_file_name = '%s.txt' % output_file_name_base
else:
123 output_file_name = '%s.box' % output_file_name_base
if status:
errors = get_errors(error_string)
raise TesseractError(status, errors)
135 f = open(output_file_name, 'rb')
Looking at the pytesseract's github pulls, it seems like support for other output types is planned but not yet implemented (the source code I used to show why .hocr file were appearing to not be found was copy/pasted from the pytesseract master branch).
Until then, I made some hackish changes to the pytesseract script to support multiple file types.
This version does not set an extension for the output file (since tesseract does that automatically) and looks through the directory that pytesseract stores its temp output files to and looks for the file that starts with the output file name (up to the first '.' character) assigned by pytesseract (without caring about the extension):
def tempnam():
''' returns a temporary file-name and directory '''
tmpfile = tempfile.NamedTemporaryFile(prefix="tess_")
return tmpfile.name, tempfile.tempdir
def image_to_string(image, lang=None, boxes=False, config=None, nice=0):
if len(image.split()) == 4:
# In case we have 4 channels, lets discard the Alpha.
# Kind of a hack, should fix in the future some time.
r, g, b, a = image.split()
image = Image.merge("RGB", (r, g, b))
(input_file_name, _) = tempnam() #'%s.bmp' % tempnam()
input_file_name += '.bmp'
(output_file_name_base, output_filename_base_dir) = tempnam()
if not boxes:
# Don’t put an extension on the output file name because Tesseract will do it automatically
output_file_name = '%s' % output_file_name_base
else:
output_file_name = '%s.box' % output_file_name_base
try:
########## DEBUGGING
#print('input file name: %s' % input_file_name)
#print('temp output name: %s' % output_file_name)
#print('temp output dir: %s' % output_filename_base_dir)
##########
image.save(input_file_name)
status, error_string = run_tesseract(input_file_name,
output_file_name_base,
lang=lang,
boxes=boxes,
config=config,
nice=nice)
if status:
errors = get_errors(error_string)
raise TesseractError(status, errors)
# find the temp output file in temp dir under whatever extension tesseract has assigned
output_file_name += '.'
output_file_name_leaf = os.path.basename(output_file_name)
print('**output file starts with %s, type: %s' % (output_file_name, type(output_file_name)))
l=os.listdir(output_filename_base_dir)
for f in l:
if f.startswith(output_file_name_leaf):
output_file_name_leaf = f
break
output_file_name_abs = os.path.join(output_filename_base_dir, output_file_name_leaf)
f = open(output_file_name_abs, 'rb')
try:
return f.read().decode('utf-8').strip()
finally:
f.close()
finally:
cleanup(input_file_name)
# if successfully created and opened temp output file
if 'output_file_name_abs' in locals():
output_file_name = output_file_name_abs
print('**temp output file %s successfully created and deleted' % output_file_name)
cleanup(output_file_name)
Hopes this helps others.
Answered By - lampShadesDrifter
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.