Issue
Click to see floor plan image:
I'm a mechanical engineer and new to programming.
I want to identify the different color rectangular ducts and information enclosed in them.
Any help would be greatly appreciated.
I've tried extracting the text using tesseract
.
Solution
Here is first simple script that extracts vector graphics having a non-black border and any text inside them.
import fitz # PyMuPDF
doc = fitz.open("input.pdf")
page = doc[0] # first page (0-based numbering scheme)
paths = page.get_drawings() # extract all vector graphics (list of dictionaries)
for path in paths:
if path["color"] is None or path["color"] == (0,0,0)
# ignore borderless graphics and black border
continue
print(f"border color {p['color']}")
text = page.get_text(clip=path["rect"]) # extract any text inside
print(f"text inside {p['rect']: {text}")
Note: I am a maintainer and the original creator of PyMuPDF.
Answered By - Jorj McKie
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.