Issue
I want to scrape the information on this PDF in python. I'm not sure where to start because it isn't organized at all. I'm used to scraping HTML. I tried converting it to HTML and that didn't really help.
How would you try to scrape this PDF? Here is a link to the PDFs (any will work, they're all similar): https://portal.charitycommissioner.je/Public-Register/ https://www.gov.im/media/1371147/publicindex_latest-15121-v2.pdf
Thank you for any help :D
Solution
It is organized - it's in a "table" - pdfplumber works well for this.
Once you have settings that correctly match your data you can .extract_table()
import pdfplumber
import pandas as pd
pdf = pdfplumber.open('file.pdf')
page = pdf.pages[0]
table = page.extract_table(
dict(vertical_strategy="text", keep_blank_chars=True)
)
df = pd.DataFrame(table)
Answered By - user15398259
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.