Issue
I'm trying to read a specific region on a PDF file. How to do it?
I've tried:
- Using PyPDF2, cropped the PDF page and read only that. It doesn't work because PyPDF2's cropbox only shrinks the "view", but keeps all the items outside the specified cropbox. So on reading the cropped pdf text with extract_text(), it reads all the "invisible" contents, not only the cropped part.
- Converting the PDF page to PNG, cropping it and using Pytesseract to read the PNG. Py tesseract doesn't work properly, don't know why.
Solution
PyMuPDF can probably do this.
I just answered another question regarding getting the "highlighted text" from a page, but the solution uses the same relevant parts of the PyMuPDF API you want:
- figure out a rectangle that defines the area of interest
- extract text based on that rectangle
and I say "probably" because I haven't actually tried it on your PDF, so I cannot say for certain that the text is amenable to this process.
import os.path
import fitz
from fitz import Document, Page, Rect
# For visualizing the rects that PyMuPDF uses compared to what you see in the PDF
VISUALIZE = True
input_path = "test.pdf"
doc: Document = fitz.open(input_path)
for i in range(len(doc)):
page: Page = doc[i]
page.clean_contents() # https://pymupdf.readthedocs.io/en/latest/faq.html#misplaced-item-insertions-on-pdf-pages
# Hard-code the rect you need
rect = Rect(0, 0, 100, 100)
if VISUALIZE:
# Draw a red box to visualize the rect's area (text)
page.draw_rect(rect, width=1.5, color=(1, 0, 0))
text = page.get_textbox(rect)
print(text)
if VISUALIZE:
head, tail = os.path.split(input_path)
viz_name = os.path.join(head, "viz_" + tail)
doc.save(viz_name)
For context, here's the project I just finished where this was working for the highlighted text, https://github.com/zacharysyoung/extract_highlighted_text.
Answered By - Zach Young
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.