Issue
I'm building a scraper to get images from a Telegram channel then using Tesseract to OCR the text. During testing I manually downloaded the images from the channel using Telegram's web interface (Windows 8.1, Chrome, right click, save as, etc) and ran Tesseract on them.
The results were perfect using a simple:
ocr_test = pytesseract.image_to_string(image).strip()
I have since incorporated the Telegram listener using Telethon which downloads the same images from the Telegram API.
The results for these images are much, much worse. I'm using the same PC, spec, environment, software versions, etc. There are 30 images in total and the issue occurs on all of them.
What causes this? Is there a way around it?
I can set about pre-processing the images but that would be annoying given the original results.
Solution
They are NOT the same images. The Chrome image is 449 x 800, and the API image is 719 x 1280. That leads to totally different letter sizes.
Additionally, the jpeg image format is unsuitable for OCR and it produces different artifacts on different image sizes.
Answered By - user898678
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.