Friday, January 28, 2022

[FIXED] Extract text from PDF File using Python with PyPDF2

January 28, 2022 pdf, pypdf2, python, scrapy, scripting No comments

Issue

I want to extract text from a given PDF.

The code used is:

from PyPDF2 import PdfFileReader
def extract_information(pdf_path):
    with open(pdf_path, 'rb') as f:
        pdf = PdfFileReader(f)
        number_of_pages = pdf.getNumPages()
        for pages in range(number_of_pages):
            page=pdf.getPage(pages)
            page_content=page.extractText()
            print(page_content)
 

if __name__ == '__main__':
    path = 'test.pdf'
    extract_information(path)

but when I run the above code I get the following output:

PS E:\Omkar\Coding\Python\pdfSearch> python .\scrape.py
 !"#$%&!'()*+&,$ !")-!+)-. !"#$%$&'$%%()%*)(+(+$,-,.-+/ 0 1234#5$&3-6#3#1!4#5$78-$0#5"#3$9:;;#<$=-$%(+,(>(?/0&1(+$2(3)-4!+&)(@15#123"$ A8B-C9D;E:F0G$;@HFI%*,JJ>*%J/H F=-D2K#3B#=->.J*EKK4=- 1#L#342L#$M!152!K$M!1#$M&1NO?JP%%$D9QQ9;IR$SDTC$*E
;FM:0@HC$:FDDG$HU$%%/%?
V>%?W*%JPJ?++ A&3#=%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9DQ!Y=V?,,W>J/P/*,/H!Z#-X:ED@@G$0FM:E9DR#Y-$0C@S-$+*)%+)%..* A&3#-$*/>,,J(?*>F3$M!1#$@'-X:ED@@G$0FM:E9D$
E551#BB-(*?$M9CE;:[;RI$ET9$%S
 !42#34$FC-$,.>>J>?C2!"$M&5#B-M&N8$;#N&14\(+O?(?\>%O.
C!4#$M&]]#K4#5-I2Z#$M&]]#K4#5-
Q!B423"-$^_I2Z#5$[123#$M&]]#K42&3-$^$$$_
H&3$'!B423"-$^`$_T&]aZ#-
M!]]$;#Ba]4B-$^$$$_M&ZZ#34B- !42#34-F3Ba1!3K#-M]2#34-0#52K!25-0#52K!1#-;!2]1&!5$0M;-
F3Ba1#5$H!Z#-F3Ba1!3K#$ ]!3-9ZN]&8#1)61&aN$H!Z#- &]2K8=-61&aN) ]!3=-%()%+)(+%%-+>$!Z
`X:ED@@G$0FM:E9D$
;#]!42&3BA2N-R#]'bXJJ>(,H$$$5!+&1(+$2(3)-4!+&)(2(*6-!(,1$2(3)-4!+&)(%&!'()*&*$/)71*891,&41($2(3)-4!+&)(;VRRIW6US6;UDSMVS]&&5$Ma]4W
:-71-17$;1*+*
M9CE;:[;RI$HU$%%,%J
09I;@ D[R$0MC$^/>>%(_$ O@O$S@`$%.JJ$H9c$U@;X$HU$
%+%%J%.JJ
OM@TFC%.$RE;RPM@T($$`$$(+(/ A8O$H!Z#-C9D;E:F0G$;@HFI A8B2K2!3$$R2"3!4a1#-

I think this has to be something related to the encoding used in the PDF but I am not able to understand this.

link to the pdf used

Thank you in advance.

Solution

To extract Text from PDF you need use OCR, in my opinion best OCR its Tesseract OCR, developed by Google, you can just install pytesseract and use it like you use on your pdf, but i highly recommend use with openCV for use OCR just on text

https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052

Answered By - Luis Bote

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 28, 2022

[FIXED] Extract text from PDF File using Python with PyPDF2

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels