Saturday, November 18, 2023

[FIXED] Tesseract OCR gives a strange output in Cloud Run instance, while local output is correct

November 18, 2023 docker, google-cloud-run, ocr, python-tesseract, tesseract No comments

Issue

We have a pipeline running in Google Cloud Platform that:

extracts crops from a text document image
processes those crops to ensure they are always black text on white background
passes the crops to pytesseract to extract the text.

Most times, everything works well and the extracted text is correct, except for some crops.

One example is a multiline crop in the format, which is often output incorrectly, e.g.:

35LURC194-     -> output as SSLUBe404-
6                           6

(this is a slightly modificed instance of the issue, but you get the gist)

Now, here is where things become weird.

As part of our debugging process, we ran the same code locally, and, for every instance where the OCR text is faulty on production (Cloud), it works accurately on the local machine!

The differences between local and Cloud environment are:

	Local	Cloud
Operating System	Arch Linux	Debian Slim Buster Docker image
Python version	3.10.10	3.8.6
RAM	8 GB	3 GB
Environment	Native	Docker Container (Cloud Run)

Things we've tried so far:

Ensured the versions of the important packages (pytesseract, torch, torchvision, Tesseract) are the same on local and production
Added more RAM and CPU to the Cloud Run instance
Upgraded the Python version in the container Dockerfile to 3.10.10
Ensured the cropped image that's being passed to the Tesseract is the same in both scenarios (same aspect ratio, looks the same)
Tripled checked that the code running locally is the same as the one that's running on cloud
Ran Tesseract with different OEM settings and the correct PSM (multiline) in both scenarios

We're running out of ideas on what could be causing this, it's baffling really. Everything up until the tesseract processing step is the same in both scenarios, so the issue must have to do with Tesseract itself or the environment, but yet, everything is the same except the Operating System itself.

Would love to hear any ideas on what else we could try, or whether someone else had a similar experience.

Solution

So in the end it was indeed a version issue, had to do with the versions of the language data files.

This answer solved it for me, I basically downloaded the language data files with wget, copied them in the Dockerfile to /usr/share/tesseract-ocr/4.00/tessdata (directory can vary depending on your OS) and it worked like a charm.

The strange thing still is that normally installing the packages that provide this language files should be enough (e.g. apt install tesseract-ocr-eng on Debian) but in my case these provided versions were not giving me the right outputs.

Note: An important step that helped me find the solution was actually running the container locally (normally it runs in Cloud Run on GCP), this allowed for a much quicker debugging and experimentation.

Answered By - ephores

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 18, 2023

[FIXED] Tesseract OCR gives a strange output in Cloud Run instance, while local output is correct

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels