Wednesday, November 15, 2023

[FIXED] Download pdf files from https website using pandas

November 15, 2023 beautifulsoup, https, pandas No comments

Issue

Good day,

II'm trying to download pdf files from a specific website using pandas and BeautifulSoup. I have used a pandas script that can download from example websites I saw online, so the script works, but when I run the script for this specific website, it runs and no files are downloaded. The website is below.

https://www.gems.gov.za/Healthcare-Providers/GEMS-Netwrk-of-Healthcare-Providers/Primary-Network/Family-Practitioners/REO-Family-Practitioners

Would anyone be able to help?

I used a script I found online (as below) to download files which worked for a test website.

# Import libraries
import requests
from bs4 import BeautifulSoup

# URL from which pdfs to be downloaded
url = "https://www.gems.gov.za/Healthcare-Providers/GEMS-Netwrk-of-Healthcare-Providers/Specialist-Network/Obstetricians-and-gynaecologists-list/"

# Requests URL and get response object
response = requests.get(url)

# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')

# Find all hyperlinks present on webpage
links = soup.find_all('a')

i = 0

# From all links check for pdf link and
# if present download file
for link in links:
    if ('.pdf' in link.get('href', [])):
        i += 1
        print("Downloading file: ", i)

        # Get response object for link
        response = requests.get(link.get('href'))

        # Write content in pdf file
        pdf = open("pdf"+str(i)+".pdf", 'wb')
        pdf.write(response.content)
        pdf.close()
        print("File ", i, " downloaded")

print("All PDF files downloaded")

Solution

All links stored PDF files end with '.asp?la=en', so you need to search for links to files according to this condition, and also all links do not contain references to the domain, so to the found link, you need to concatenate links with the domain name. Here is the working code:

import requests
from bs4 import BeautifulSoup

# URL from which pdfs to be downloaded
url = "https://www.gems.gov.za/Healthcare-Providers/GEMS-Netwrk-of-Healthcare-Providers/Specialist-Network/Obstetricians-and-gynaecologists-list/"

# Requests URL and get response object
response = requests.get(url)

# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')

# Find all hyperlinks present on webpage
links = soup.find_all('a')
i = 0
base_url = "https://www.gems.gov.za"
# From all links check for pdf link and
# if present download file
for link in links:
    href = link.get("href")
    try:
        if href.split(".")[-1] == 'ashx?la=en':
            i += 1
            print("Downloading file: ", i)
            response = requests.get(f"{base_url}{href}")
            
            # Write content in pdf file
            pdf = open("pdf"+str(i)+".pdf", 'wb')
            pdf.write(response.content)
            pdf.close()
            print("File ", i, " downloaded")
    except AttributeError:
        pass

Answered By - user510170

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 15, 2023

[FIXED] Download pdf files from https website using pandas

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels