Issue
The webpage can open in my browser.
https://www.sec.gov/files/company_tickers_exchange.json
Add browser user agent when to get the webpage with urllib:
from urllib.request import Request, urlopen
url = "https://www.sec.gov/files/company_tickers_exchange.json"
req = Request(
url=url,
headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()
It run into error:
urllib.error.HTTPError: HTTP Error 403: Forbidden
Although i can get the webpage with playwright:
from playwright.sync_api import sync_playwright as playwright
pw = playwright().start()
browser = pw.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
url = "https://www.sec.gov/files/company_tickers_exchange.json"
page.goto(url)
pge.content()
I feel it is a clunky method,how to get the webpage only with urllib?
Solution
Judging by the Fair Access section of SEC.gov | Accessing EDGAR Data, passing a normal browser header from a non-browser client (as you've tried to do) will likely be met with a negative response:
Please declare your user agent in request headers:
Sample Declared Bot Request Headers:
[Header] [Value] User-Agent: Sample Company Name [email protected] Accept-Encoding: gzip, deflate Host: www.sec.gov
Heeding this advice seems to work in my test on Repl.it:
from urllib.request import Request, urlopen
url = "https://www.sec.gov/files/company_tickers_exchange.json"
req = Request(
url=url,
headers={'User-Agent': 'Sean Quinn [email protected]'}
)
webpage = urlopen(req).read()
print(webpage)
Answered By - esqew
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.