Issue
I'm running a simple PDF to image conversion using Python PDF2Image library. I can certainly understand that the max memory threshold is being crossed by this library to arrive at this error. But, the PDF is 6.6 MB (approx), then why would it take up GBs of memory to throw a memory error?
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdf2image import convert_from_path
>>> pages = convert_from_path(r'C:\Users\aakashba598\Documents\pwc-annual-report-2017-2018.pdf', 200)
Exception in thread Thread-3:
Traceback (most recent call last):
File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 917, in _bootstrap_inner
self.run()
File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\subprocess.py", line 1215, in _readerthread
buffer.append(fh.read())
MemoryError
Also, what is the possible solution to this?
Update: When I reduced the dpi parameter from the convert_from_path function, it works like a charm. But the pictures produced are low quality (for obvious reasons). Is there a way to fix this? Like batch by batch creation of images and clearing memory everytime. If there is a way, how to go about it?
Solution
Convert the PDF in chunks of 10 pages at a time (1-10, 11-20... and so on)
from pdf2image import pdfinfo_from_path, convert_from_path
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path=None)
maxPages = info["Pages"]
for page in range(1, maxPages+1, 10) :
convert_from_path(pdf_file, dpi=200, first_page=page, last_page = min(page+10-1,maxPages))
Answered By - napuzba
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.