Thursday, January 20, 2022

[FIXED] Why is my code only creating a jpeg from the last page of the PDF and therefore only writing the last page to a text file?

January 20, 2022 pdf, python, python-tesseract No comments

Issue

I need to scrape a huge amount of text from a PDF for certain keywords then list those keywords on the pages they are found. I'm admittedly very new to Python and starting out by simply following a tutorial that scrapes from a PDF to a JPEG and writes it to text. However, I'm running into some problems even with this. My issue is that although I do seem to be able to turn some of this PDF into txt it only taking one page, the last page. My question is why? And how do I fix this?

Thanks

from PIL import Image 
import pytesseract 
import sys 
from pdf2image import convert_from_path 
import os 

PDF_file = "file2.pdf"
  
  
pages = convert_from_path(PDF_file, 500) 
  
image_counter = 1
  
for page in pages: 
  
   
    filename = "page_"+str(image_counter)+".jpg"
      
    page.save(filename, 'JPEG') 
  
    image_counter = image_counter + 1
  

filelimit = image_counter-1
  
outfile = "out_text.txt"
  

f = open(outfile, "a") 
  
for i in range(1, filelimit + 1): 
  
    
          
    text = str(((pytesseract.image_to_string(Image.open(filename))))) 
  
   
    text = text.replace('-\n', '')     
  
    f.write(text) 
  
f.close()

Solution

The problem is in the filename declaration.

When the first loop finishes:

for page in pages: 
    filename = "page_"+str(image_counter)+".jpg"
    page.save(filename, 'JPEG') 
    image_counter = image_counter + 1

Your filename variable set to the final image_counter. When you read the using filename variable you read the last image for 1 to filelimit + 1 time.

One solution is re-declaring filename in the second-loop.

for i in range(1, filelimit + 1): 
    filename = "page_"+str(i)+".jpg"
    text = str(((pytesseract.image_to_string(Image.open(filename))))) 
    text = text.replace('-\n', '')     
    f.write(text) 
  
f.close()

That should solve the problem for reading each filename separately.

Answered By - Ahx

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 20, 2022

[FIXED] Why is my code only creating a jpeg from the last page of the PDF and therefore only writing the last page to a text file?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels