Issue
So I am putting together a simple Python script to OCR a PDF:
from PIL import Image
from tika import parser
import argparse
import img2pdf
import ocrmypdf
def main():
parser = argparse.ArgumentParser(description="Get text from image.")
parser.add_argument("image_path", metavar="i", help="The path to the image being used.")
args = parser.parse_args()
image_path = args.image_path
pdf_from_image_file_name = convert_to_pdf(image_path)
pdf_w_ocr_file_name = ocr_pdf()
raw_text_from_ocr_pdf = get_text_from_pdf()
print(raw_text_from_ocr_pdf)
def convert_to_pdf(image_path, new_pdf_file_name="pdf_from_image"):
temp_image = Image.open(image_path)
pdf_bytes = img2pdf.convert(temp_image.filename)
new_file = open('./' + new_pdf_file_name + '.pdf', 'wb')
new_file.write(pdf_bytes)
temp_image.close()
new_file.close()
return new_pdf_file_name
def ocr_pdf(pdf_file_path="./temp_pdf_file_name.pdf", new_pdf_file_name="pdf_w_ocr.pdf"):
ocrmypdf.ocr(pdf_file_path, './'+new_pdf_file_name, deskew=True)
return new_pdf_file_name
def get_text_from_pdf(pdf_file_path="./pdf_w_ocr.pdf"):
raw_pdf = parser.from_file(pdf_file_path)
return raw_pdf['content']
if __name__ == '__main__':
main()
When the script hits import ocrmypdf
it triggers a [WinError 2] The system cannot find the file specified
error but continues past it. The conversion from JPG or PNG to PDF works and outputs just fine. However, when reaching the ocrmypdf.ocr(pdf_file_path, './'+new_pdf_file_name, deskew=True)
I get a ValueError: invalid version number '4.0.0.20181030'
.
The full stack is:
[WinError 2] The system cannot find the file specified
Traceback (most recent call last):
File "workshop_v1.py", line 71, in <module>
main()
File "workshop_v1.py", line 49, in main
pdf_w_ocr_file_name = ocr_pdf()
File "workshop_v1.py", line 63, in ocr_pdf
ocrmypdf.ocr(pdf_file_path, './'+new_pdf_file_name, deskew=True)
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\api.py", line 339, in ocr
check_options(options, plugin_manager)
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\_validation.py", line 271, in check_options
_check_options(options, plugin_manager, ocr_engine_languages)
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\_validation.py", line 266, in _check_options
plugin_manager.hook.check_options(options=options)
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\hooks.py", line 286, in __call__
return self._hookexec(self, self.get_hookimpls(), kwargs)
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\manager.py", line 93, in _hookexec
return self._inner_hookexec(hook, methods, kwargs)
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\manager.py", line 87, in <lambda>
firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\callers.py", line 208, in _multicall
return outcome.get_result()
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\callers.py", line 80, in get_result
raise ex[1].with_traceback(ex[2])
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\pluggy\callers.py", line 187, in _multicall
res = hook_impl.function(*args)
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\builtin_plugins\tesseract_ocr.py", line 84, in check_options
version_parser=tesseract.TesseractVersion,
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\subprocess\__init__.py", line 313, in check_external_program
if found_version and version_parser(found_version) < version_parser(need_version):
File "C:\Users\xxx\anaconda3\envs\python37\lib\distutils\version.py", line 40, in __init__
self.parse(vstring)
File "C:\Users\xxx\anaconda3\envs\python37\lib\site-packages\ocrmypdf\_exec\tesseract.py", line 72, in parse
super().parse(vstring)
File "C:\Users\xxx\anaconda3\envs\python37\lib\distutils\version.py", line 137, in parse
raise ValueError("invalid version number '%s'" % vstring)
ValueError: invalid version number '4.0.0.20181030'
I'm running this on a x64 PC with Windows 10. Specifically, I'm running a Python 3.7.10 environment via Anaconda. Package version info in Python includes (via pip freeze
):
- pytesseract v0.3.7
- ocrmypdf 12.1.0
- ghostscript v0.7
Other potentially important version information outside python includes:
- tesseract-ocr v4.0.0.20181030 (I've added and tried a number of environmental variables with this, detailed below)
- leptonica v1.76.0
- ghostscript v9.54.0
- qpdf 10.3.2 (this was downloaded and then the files were placed in the
C:/Windows/System32
directory)
Tesseract is installed here: C:\Program Files (x86)\Tesseract-OCR\
, so I've tried the following environmental variables (as user variables):
OCRMYPDF_TESSERACT = C:\Program Files (x86)\Tesseract-OCR\tesseract.exe
- Added
C:\Program Files (x86)\Tesseract-OCR
to the end ofPath
TESSDATA_PREFIX = C:\Program Files (x86)\Tesseract-OCR\tessdata
Add pointers or ideas would be much appreciated!
Solution
The repository was updated here per the issue I opened here: https://github.com/jbarlow83/OCRmyPDF/issues/795
.
To install use: pip3 install pip install git+https://github.com/jbarlow83/OCRmyPDF.git#egg=ocrmypdf
.
I still get [WinError 2] The system cannot find the file specified
, but it works so I'm not going to question it at this point.
Answered By - user3684314
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.