Friday, June 24, 2022

[FIXED] How to verify if the image contains noise in background before ‘OCR’ing

June 24, 2022 background-process, image-processing, noise-reduction, opencv, python-tesseract No comments

Issue

I have several types of images that I need to extract text from. I can manually classify the images into 3 categories based on the noise on the background:

Images with no noise.
Images with some light noise in the background.
Heavy noise in the background.

For the category 1 images, I could apply OCR’ing fine without problems. → basic case.

For the category 2 images and some of the category 3 images, I could manage to extract the texts by applying the following methods:

Grayscale, Gaussian blur, Otsu’s threshold
Morph open to remove noise and invert the image → then perform text extraction.

For the OCR’ing task, one removing noise method is obviously not working for all images. So, Is there any method for classifying the level background noise of the images?

Please all suggestions are welcome. Thanks in advance.

Solution

Following up on your comment from other question here are some things you could try. Some combinations of ideas below should help.

Image Embedding and Vector Clustering

Manual

Use a pretrained network such as resnet on imagenet (may not work good) or a simple pretrained network trained on MNIST/EMNIST.

Extract and concat some layers flattened weight vectors toward end of network. Apply dimensionality reduction and apply nearest neighbor/approximate nearest neighbor algorithms to find closest matches. Set number of clusters 3 as you have 3 types of images.

For nearest neighbor start with KNN. There are also many libraries in github that may help such as faiss, annoy etc.

More can be found from,

https://github.com/topics/nearest-neighbor-search

https://github.com/topics/approximate-nearest-neighbor-search

If result of above is not good enough try finetuning only last few layers MNIST/EMNIST trained network.

Using Existing Libraries

For grouping/finding similar images look into,

https://github.com/jina-ai/jina

You should be able to find more similarity clustering using tags neural-search, image-search on github.

https://github.com/topics/neural-search

https://github.com/topics/image-search

OCR

Try easyocr as it worked better for me than tesserect last time used ocr.
Run it first on whole document to see if requirements met.
Use not so tight cropping instead some/large padding around text if possible with no other text nearby. Another way is try padding in all direction in tight cropped text to see if it improves ocr result.
For tesserect see if tools mentioned in improving quality doc helps.

Classification

If you already have data sorted into 3 different directory and want to classify future images only then I suggest a neural network. Modify mnist or cifar example of pytorch or tensorflow to train and classify test images.
Based on sample images it looks like computer font instead of handwritten text. If that is the case Template matching at multiple scales may help. You have to see if the noise affects the matching result. Image from, https://www.pyimagesearch.com/2021/03/22/opencv-template-matching-cv2-matchtemplate/

Noise Removal

Here also you can go with a neural network. Train a denoising autoencoder with Category 1 images, corrupted type 1 images by adding noise that mimicks Category 2 and Category 3 images. This way the neural network will classify the 3 image categories without needing manually create dataset and in post processing you can use another neural network or image processing method to remove noise based on category type. Image from, https://keras.io/examples/vision/autoencoder/
Try existing libraries or pretrained networks on github to remove noise in the whole document/cropped region. Look into rembg if it works on text documents.

Answered By - B200011011

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, June 24, 2022

[FIXED] How to verify if the image contains noise in background before ‘OCR’ing

Issue

Solution

Image Embedding and Vector Clustering

Manual

Using Existing Libraries

OCR

Classification

Noise Removal

0 comments:

Post a Comment

Popular Posts

Labels