Issue
I need to scrape a huge amount of text from a PDF for certain keywords then list those keywords on the pages they are found. I'm admittedly very new to Python and starting out by simply following a tutorial that scrapes from a PDF to a JPEG and writes it to text. However, I'm running into some problems even with this. My issue is that although I do seem to be able to turn some of this PDF into txt it only taking one page, the last page. My question is why? And how do I fix this?
Thanks
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
PDF_file = "file2.pdf"
pages = convert_from_path(PDF_file, 500)
image_counter = 1
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(filename, 'JPEG')
image_counter = image_counter + 1
filelimit = image_counter-1
outfile = "out_text.txt"
f = open(outfile, "a")
for i in range(1, filelimit + 1):
text = str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
f.write(text)
f.close()
Solution
The problem is in the filename
declaration.
When the first loop finishes:
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(filename, 'JPEG')
image_counter = image_counter + 1
Your filename
variable set to the final image_counter. When you read the using filename
variable you read the last image for 1
to filelimit + 1
time.
One solution is re-declaring filename
in the second-loop.
for i in range(1, filelimit + 1):
filename = "page_"+str(i)+".jpg"
text = str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
f.write(text)
f.close()
That should solve the problem for reading each filename separately.
Answered By - Ahx
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.