Issue
so i'm trying to make a code that make a webscrapping in a link and detect PDF files, with the data they will go to form a dataframe with these informations. My question is: I want to update the code, the code only detect links with the final extension ".pdf". How could i make him detect pdf files in links without the extension ? My code is this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://machado.mec.gov.br/obra-completa-lista/itemlist/category/24-conto"
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
links = soup.find_all("a")
results = []
for link in links:
href = link.get("href")
if href is not None:
file_url = url + href
file_response = requests.head(file_url)
content_type = file_response.headers.get("Content-Type")
is_pdf = content_type == 'application/pdf' or href.lower().endswith('.pdf')
status = file_response.status_code
if status == 404: # Verifica se o status é 404 (Not Found)
results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})
else:
results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})
df = pd.DataFrame(results)
df
else:
print("Fail", response.status_code)
I make the code and he is running correctly, but i want to improve him.
Solution
With file_url = url + href
you're constructing URL that doesn't exist on the server. Try to parse the links that contain download
and add only the domain to the URL (base_url
):
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://machado.mec.gov.br/obra-completa-lista/itemlist/category/24-conto"
base_url = 'https://machado.mec.gov.br'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
links = soup.select("a[href*=download]")
results = []
for link in links:
href = link["href"]
if href.startswith('http'):
file_url = href
else:
file_url = base_url + href
file_response = requests.head(file_url)
content_type = file_response.headers.get("Content-Type")
is_pdf = content_type == 'application/pdf' or href.lower().endswith('.pdf')
status = file_response.status_code
if status == 404: # Verifica se o status é 404 (Not Found)
results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})
else:
results.append({"Link": file_url, "Status": status, "Arquivo": href, "PDF": is_pdf})
df = pd.DataFrame(results)
print(df)
else:
print("Fail", response.status_code)
Prints:
Link Status Arquivo PDF
0 https://machado.mec.gov.br/obra-completa-lista/item/download/31_15b64419a44a2b6ba9781ae001275ae8 200 /obra-completa-lista/item/download/31_15b64419a44a2b6ba9781ae001275ae8 True
1 https://machado.mec.gov.br/obra-completa-lista/item/download/30_8e623caa384980ca20f48a66e691074f 200 /obra-completa-lista/item/download/30_8e623caa384980ca20f48a66e691074f True
2 https://machado.mec.gov.br/obra-completa-lista/item/download/29_008edfdf58623bb13d27157722a7281e 200 /obra-completa-lista/item/download/29_008edfdf58623bb13d27157722a7281e True
3 https://machado.mec.gov.br/obra-completa-lista/item/download/28_b10fd1f9a75bcaa4573e55e677660131 200 /obra-completa-lista/item/download/28_b10fd1f9a75bcaa4573e55e677660131 True
4 https://machado.mec.gov.br/obra-completa-lista/item/download/26_29eaa69154e158508ef8374fcb50937a 200 /obra-completa-lista/item/download/26_29eaa69154e158508ef8374fcb50937a True
5 https://machado.mec.gov.br/obra-completa-lista/item/download/25_fcddef9a9bd325ad2003c64f4f4eb884 200 /obra-completa-lista/item/download/25_fcddef9a9bd325ad2003c64f4f4eb884 True
6 https://machado.mec.gov.br/obra-completa-lista/item/download/24_938f74988ddbf449047ecc5c5b575985 200 /obra-completa-lista/item/download/24_938f74988ddbf449047ecc5c5b575985 True
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.