Issue
We have a pipeline running in Google Cloud Platform that:
- extracts crops from a text document image
- processes those crops to ensure they are always black text on white background
- passes the crops to pytesseract to extract the text.
Most times, everything works well and the extracted text is correct, except for some crops.
One example is a multiline crop in the format, which is often output incorrectly, e.g.:
35LURC194- -> output as SSLUBe404-
6 6
(this is a slightly modificed instance of the issue, but you get the gist)
Now, here is where things become weird.
As part of our debugging process, we ran the same code locally, and, for every instance where the OCR text is faulty on production (Cloud), it works accurately on the local machine!
The differences between local and Cloud environment are:
Local | Cloud | |
---|---|---|
Operating System | Arch Linux | Debian Slim Buster Docker image |
Python version | 3.10.10 | 3.8.6 |
RAM | 8 GB | 3 GB |
Environment | Native | Docker Container (Cloud Run) |
Things we've tried so far:
- Ensured the versions of the important packages (pytesseract, torch, torchvision, Tesseract) are the same on local and production
- Added more RAM and CPU to the Cloud Run instance
- Upgraded the Python version in the container Dockerfile to 3.10.10
- Ensured the cropped image that's being passed to the Tesseract is the same in both scenarios (same aspect ratio, looks the same)
- Tripled checked that the code running locally is the same as the one that's running on cloud
- Ran Tesseract with different OEM settings and the correct PSM (multiline) in both scenarios
We're running out of ideas on what could be causing this, it's baffling really. Everything up until the tesseract processing step is the same in both scenarios, so the issue must have to do with Tesseract itself or the environment, but yet, everything is the same except the Operating System itself.
Would love to hear any ideas on what else we could try, or whether someone else had a similar experience.
Solution
So in the end it was indeed a version issue, had to do with the versions of the language data files.
This answer solved it for me, I basically downloaded the language data files with wget
, copied them in the Dockerfile to /usr/share/tesseract-ocr/4.00/tessdata
(directory can vary depending on your OS) and it worked like a charm.
The strange thing still is that normally installing the packages that provide this language files should be enough (e.g. apt install tesseract-ocr-eng
on Debian) but in my case these provided versions were not giving me the right outputs.
Note: An important step that helped me find the solution was actually running the container locally (normally it runs in Cloud Run on GCP), this allowed for a much quicker debugging and experimentation.
Answered By - ephores
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.