Issue
I want to extract text from a given PDF.
The code used is:
from PyPDF2 import PdfFileReader
def extract_information(pdf_path):
with open(pdf_path, 'rb') as f:
pdf = PdfFileReader(f)
number_of_pages = pdf.getNumPages()
for pages in range(number_of_pages):
page=pdf.getPage(pages)
page_content=page.extractText()
print(page_content)
if __name__ == '__main__':
path = 'test.pdf'
extract_information(path)
but when I run the above code I get the following output:
PS E:\Omkar\Coding\Python\pdfSearch> python .\scrape.py
!"#$%&!'()*+&,$ !")-!+)-. !"#$%$&'$%%()%*)(+(+$,-,.-+/ 0 1234#5$&3-6#3#1!4#5$78-$0#5"#3$9:;;#<$=-$%(+,(>(?/0&1(+$2(3)-4!+&)(@15#123"$ A8B-C9D;E:F0G$;@HFI%*,JJ>*%J/H F=-D2K#3B#=->.J*EKK4=- 1#L#342L#$M!152!K$M!1#$M&1NO?JP%%$D9QQ9;IR$SDTC$*E
;FM:0@HC$:FDDG$HU$%%/%?
V>%?W*%JPJ?++ A&3#=%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9D%(+,(>(?X:ED@@G$0FM:E9DQ!Y=V?,,W>J/P/*,/H!Z#-X:ED@@G$0FM:E9DR#Y-$0C@S-$+*)%+)%..* A&3#-$*/>,,J(?*>F3$M!1#$@'-X:ED@@G$0FM:E9D$
E551#BB-(*?$M9CE;:[;RI$ET9$%S
!42#34$FC-$,.>>J>?C2!"$M&5#B-M&N8$;#N&14\(+O?(?\>%O.
C!4#$M&]]#K4#5-I2Z#$M&]]#K4#5-
Q!B423"-$^_I2Z#5$[123#$M&]]#K42&3-$^$$$_
H&3$'!B423"-$^`$_T&]aZ#-
M!]]$;#Ba]4B-$^$$$_M&ZZ#34B- !42#34-F3Ba1!3K#-M]2#34-0#52K!25-0#52K!1#-;!2]1&!5$0M;-
F3Ba1#5$H!Z#-F3Ba1!3K#$ ]!3-9ZN]&8#1)61&aN$H!Z#- &]2K8=-61&aN) ]!3=-%()%+)(+%%-+>$!Z
`X:ED@@G$0FM:E9D$
;#]!42&3BA2N-R#]'bXJJ>(,H$$$5!+&1(+$2(3)-4!+&)(2(*6-!(,1$2(3)-4!+&)(%&!'()*&*$/)71*891,&41($2(3)-4!+&)(;VRRIW6US6;UDSMVS]&&5$Ma]4W
:-71-17$;1*+*
M9CE;:[;RI$HU$%%,%J
09I;@ D[R$0MC$^/>>%(_$ O@O$S@`$%.JJ$H9c$U@;X$HU$
%+%%J%.JJ
OM@TFC%.$RE;RPM@T($$`$$(+(/ A8O$H!Z#-C9D;E:F0G$;@HFI A8B2K2!3$$R2"3!4a1#-
I think this has to be something related to the encoding used in the PDF but I am not able to understand this.
Thank you in advance.
Solution
To extract Text from PDF you need use OCR, in my opinion best OCR its Tesseract OCR, developed by Google, you can just install pytesseract and use it like you use on your pdf, but i highly recommend use with openCV for use OCR just on text
Answered By - Luis Bote
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.