Issue
I've been trying to scrape a website for its excel files. I'm planning on doing this once for the bulk of data it contains from its data archives section. I've been able to download individual files one at a time with urlib requests and tried it on several different files manually. But when I try to create a function to download all of them I've been receiving some errors. The first error that was occurring was just getting the http file addresses as a list. I changed the verify to false (not the best practice for security reasons) to work around the certification ssl error it was giving me and it worked. I then attempted again going further by scrapping and downloading it to a specific folder. I've done this before with a similar project and didn't nearly have this hard of time with certification error ssl.
import requests
from bs4 import BeautifulSoup
import os
os.chdir(r'C:\ The out put path were it will go\\')
url = 'https://pages.stern.nyu.edu/~adamodar/pc/archives/'
reqs = requests.get(url, verify=False)
soup = BeautifulSoup(reqs.text, 'html.parser')
file_type = '.xls'
urls = []
for link in soup.find_all('a'):
file_link = link.get('href')
if file_type in file_link:
print(file_link)
with open(link.text, 'wb') as file:
response = requests.get(url + file_link)
file.write(response.content)
This is the error is has been giving me even after verifying false, which seemed to solve the problem before generating the list. It's grabbing the first file each time tried but it doesn't loop to the next.
requests.exceptions.SSLError: HTTPSConnectionPool(host='pages.stern.nyu.edu', port=443): Max retries exceeded with url: /~adamodar/pc/archives/BAMLEMPBPUBSICRPIEY19.xls (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')
What am I missing? I thought I fixed the verification issue.
Solution
You forgot to set verify=False
when you get your files
urls = []
for link in soup.find_all('a'):
file_link = link.get('href')
if file_type in file_link:
print(file_link)
with open(link.text, 'wb') as file:
response = requests.get(url + file_link, verify=False) # <-- This is where you forgot
file.write(response.content)
Answered By - Minh Dao
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.