Issue
Good day,
II'm trying to download pdf files from a specific website using pandas and BeautifulSoup. I have used a pandas script that can download from example websites I saw online, so the script works, but when I run the script for this specific website, it runs and no files are downloaded. The website is below.
Would anyone be able to help?
I used a script I found online (as below) to download files which worked for a test website.
# Import libraries
import requests
from bs4 import BeautifulSoup
# URL from which pdfs to be downloaded
url = "https://www.gems.gov.za/Healthcare-Providers/GEMS-Netwrk-of-Healthcare-Providers/Specialist-Network/Obstetricians-and-gynaecologists-list/"
# Requests URL and get response object
response = requests.get(url)
# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')
# Find all hyperlinks present on webpage
links = soup.find_all('a')
i = 0
# From all links check for pdf link and
# if present download file
for link in links:
if ('.pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i)
# Get response object for link
response = requests.get(link.get('href'))
# Write content in pdf file
pdf = open("pdf"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
print("All PDF files downloaded")
Solution
All links stored PDF files end with '.asp?la=en', so you need to search for links to files according to this condition, and also all links do not contain references to the domain, so to the found link, you need to concatenate links with the domain name. Here is the working code:
import requests
from bs4 import BeautifulSoup
# URL from which pdfs to be downloaded
url = "https://www.gems.gov.za/Healthcare-Providers/GEMS-Netwrk-of-Healthcare-Providers/Specialist-Network/Obstetricians-and-gynaecologists-list/"
# Requests URL and get response object
response = requests.get(url)
# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')
# Find all hyperlinks present on webpage
links = soup.find_all('a')
i = 0
base_url = "https://www.gems.gov.za"
# From all links check for pdf link and
# if present download file
for link in links:
href = link.get("href")
try:
if href.split(".")[-1] == 'ashx?la=en':
i += 1
print("Downloading file: ", i)
response = requests.get(f"{base_url}{href}")
# Write content in pdf file
pdf = open("pdf"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
except AttributeError:
pass
Answered By - user510170
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.