Wednesday, February 7, 2024

[FIXED] How to get dpi of an image cropped with Python? Tesseract --dpi parameter

February 07, 2024 dpi, opencv, pymupdf, python-tesseract, tesseract No comments

Issue

My code opens a pdf, converts the first page to an image, then cuts rectangles out of this image by coordinates and extracts text from each cropped rectangle using Tesseract.

I discovered that in some cases for larger images OCR performs much worse than in other cases.

After playing around with Tesseract in the command line, I also discovered that for some images Tesseract estimates the resolution itself which affects the result.

I also played around with the --dpi parameter. For some images the best results were obtained with --dpi 1800, for some with --dpi 300. I'm looking for a way to set the dpi for my images before extracting text or a way to find the dpi of my images.

I also tried to use pix.set_dpi() and get_pixmap(dpi = ..) and that didn't improve anything. I would be thankful for any suggestions

Here is the code I use:

        page = doc.load_page(0)
        page_size = page.rect
        zoom = 3
        mat = fitz.Matrix(zoom, zoom)
        pix = page.get_pixmap(matrix=mat)
        img_data = pix.samples
        img_array = np.frombuffer(img_data, dtype=np.uint8)
        img_array = img_array.reshape(pix.height, pix.width, pix.n)
        img = cv.cvtColor(img_array, cv.COLOR_RGB2BGR)
                    
        #...
        k=0
        result_dict = {}
        for i, rect in enumerate(rectangles):
            x1, y1, x2, y2 = rect
            roi = img[y1:y2, x1:x2]
            k+=1
            text = pytesseract.image_to_string(roi, lang="eng+deu")

Solution

Only OCR a region of a PDF page like this:

import fitz
doc = fitz.open("input.pdf")
page = doc[pno]  # 0-based page number
rect = fitz.Rect(x0, y0, x1, y1)  # an area on the page
pix = page.get_pixmap(clip=rect, dpi=150)

# make a 1-page temp PDF from the area and OCR it
ocr = fitz.open("pdf", pix.pdfocr_tobytes())  # 1-page temp PDF
ocrpage = ocr[0]
text = ocrpage.get_text()  # OCRed text

Answered By - Jorj McKie

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, February 7, 2024

[FIXED] How to get dpi of an image cropped with Python? Tesseract --dpi parameter

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels