Issue
Is it possible to write to a pdf file retroactively using pytesseract.image_to_data()
output?
For my OCR pipeline, I needed granular access to my pdf's ocr'ed data. I requested that using this method:
ocr_dataframe = pytesseract.image_to_data(
tesseract_image,
output_type=pytesseract.Output.DATAFRAME,
config=PYTESSERACT_CUSTOM_CONFIG
)
Now, I want to extract some tabular data from the pdf using pdfplumber. However, pdfplumber must be fed using one of three inputs:
- path to your PDF file
- file object, loaded as bytes
- file-like object, loaded as bytes
I am aware that I can use pytesseract to convert my original pdf to a searchable one (in bytes representation) using the following method:
# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
However, I would like to avoid ocr'ing my pdfs twice. Is it possible to combine the output from pytesseract.image_to_data()
with the original image and create some kind of bytes representation?
Any help would be much appreciated!
Solution
Okay, so I am pretty sure that this was an impossible task I was trying to complete.
By nature pytesseract.Output.DATAFRAME
produces a pandas dataframe. Nowhere in that data structure is the original image. The output is just rows and columns of text data. No pixels, no nothing.
Instead, I created a class that could hold the original image and the ocr output dataframe at the same time. Here is what the instance initialization looks like:
def __init__(self, temp_image_path):
self.image_path = pathlib.Path(temp_image_path)
self.image = cv2.imread(temp_image_path, cv2.IMREAD_GRAYSCALE)
self.ocr_dataframe = self.ocr()
def ocr(self):
#########################################
# Preprocess image in prep for pytesseract ocr
########################################
tesseract_image = ocr_preprocess(self.image)
########################################
# OCR image using pytesseract
########################################
ocr_dataframe = pytesseract.image_to_data(
tesseract_image,
output_type=pytesseract.Output.DATAFRAME,
config=PYTESSERACT_CUSTOM_CONFIG
)
return ocr_dataframe
This may be a little memory intensive, but I want to avoid having to write many images.
Answered By - abrezey
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.