Wednesday, November 15, 2023

[FIXED] Scraping PDF files in Python with Requests and Beautifulsoup

November 15, 2023 beautifulsoup, python, python-requests, web-scraping No comments

Issue

so i'm trying to make a code that make a webscrapping in a link and detect PDF files, with the data they will go to form a dataframe with these informations. My question is: I want to update the code, the code only detect links with the final extension ".pdf". How could i make him detect pdf files in links without the extension ? My code is this:

import requests
from bs4 import BeautifulSoup
import pandas as pd


url = "https://machado.mec.gov.br/obra-completa-lista/itemlist/category/24-conto"


response = requests.get(url)


if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    links = soup.find_all("a")
    results = []
    for link in links:
        href = link.get("href")
        if href is not None:
            file_url = url + href  
            file_response = requests.head(file_url) 
            content_type = file_response.headers.get("Content-Type")
            is_pdf = content_type == 'application/pdf' or href.lower().endswith('.pdf')

            status = file_response.status_code
            if status == 404:  # Verifica se o status é 404 (Not Found)
                results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})
            else:
                results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})

    
    df = pd.DataFrame(results)
    df

else:
    print("Fail", response.status_code)

I make the code and he is running correctly, but i want to improve him.

Solution

With file_url = url + href you're constructing URL that doesn't exist on the server. Try to parse the links that contain download and add only the domain to the URL (base_url):

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://machado.mec.gov.br/obra-completa-lista/itemlist/category/24-conto"
base_url = 'https://machado.mec.gov.br'

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    links = soup.select("a[href*=download]")
    results = []
    for link in links:
        href = link["href"]

        if href.startswith('http'):
            file_url = href
        else:
            file_url = base_url + href

        file_response = requests.head(file_url)
        content_type = file_response.headers.get("Content-Type")
        is_pdf = content_type == 'application/pdf' or href.lower().endswith('.pdf')
        status = file_response.status_code
        if status == 404:  # Verifica se o status é 404 (Not Found)
            results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})
        else:
            results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})

    df = pd.DataFrame(results)
    print(df)

else:
    print("Fail", response.status_code)

Prints:

                                                                                               Link  Status                                                                 Arquivo   PDF
0  https://machado.mec.gov.br/obra-completa-lista/item/download/31_15b64419a44a2b6ba9781ae001275ae8     200  /obra-completa-lista/item/download/31_15b64419a44a2b6ba9781ae001275ae8  True
1  https://machado.mec.gov.br/obra-completa-lista/item/download/30_8e623caa384980ca20f48a66e691074f     200  /obra-completa-lista/item/download/30_8e623caa384980ca20f48a66e691074f  True
2  https://machado.mec.gov.br/obra-completa-lista/item/download/29_008edfdf58623bb13d27157722a7281e     200  /obra-completa-lista/item/download/29_008edfdf58623bb13d27157722a7281e  True
3  https://machado.mec.gov.br/obra-completa-lista/item/download/28_b10fd1f9a75bcaa4573e55e677660131     200  /obra-completa-lista/item/download/28_b10fd1f9a75bcaa4573e55e677660131  True
4  https://machado.mec.gov.br/obra-completa-lista/item/download/26_29eaa69154e158508ef8374fcb50937a     200  /obra-completa-lista/item/download/26_29eaa69154e158508ef8374fcb50937a  True
5  https://machado.mec.gov.br/obra-completa-lista/item/download/25_fcddef9a9bd325ad2003c64f4f4eb884     200  /obra-completa-lista/item/download/25_fcddef9a9bd325ad2003c64f4f4eb884  True
6  https://machado.mec.gov.br/obra-completa-lista/item/download/24_938f74988ddbf449047ecc5c5b575985     200  /obra-completa-lista/item/download/24_938f74988ddbf449047ecc5c5b575985  True

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 15, 2023

[FIXED] Scraping PDF files in Python with Requests and Beautifulsoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels