Issue
I am trying to convert many pdf files into txt. My pdf files are organized in subdirectories within a directory. So I have three layers: directory --> subdirectories --> multiple pdf files in each subdirectory. I am using the following code which is giving me this error ValueError: too many values to unpack (expected 3)
. The code works when I convert files in a single directory but not in multiple subdirectories.
It might be quite simple but I cannot get my head around it. Any help would be much appreciated. Thanks.
import pytesseract
from pdf2image import convert_from_path
import glob
pdfs = glob.glob(r"K:\pdf_files")
for pdf_path, dirs, files in pdfs:
for file in files:
convert_from_path(os.path.join(pdf_path, file), 500)
for pageNum,imgBlob in enumerate(pages):
text = pytesseract.image_to_string(imgBlob,lang='eng')
with open(f'{pdf_path}.txt', 'a') as the_file:
the_file.write(text)
Solution
I have just solved the problem in a simpler way by adding *
to specify all subdirectories in the directory:
import pytesseract
from pdf2image import convert_from_path
import glob
pdfs = glob.glob(r"K:\pdf_files\*\*.pdf")
for pdf_path in pdfs:
pages = convert_from_path(pdf_path, 500)
for pageNum,imgBlob in enumerate(pages):
text = pytesseract.image_to_string(imgBlob,lang='eng')
with open(f'{pdf_path}.txt', 'a') as the_file:
the_file.write(text)
Answered By - crackers
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.