Issue
I need to extract all url from the Elements which you can see by right clicking on chrome and doing inspect.
url = fr'https://www.ercot.com/mp/data-products/data-product-details?id=NP6-788-CD'
The url displayed on the right is seen when you do inspect on zip on left in following image:
I am trying the following but both zip_urls1
and zip_urls2
are empty:
url = fr'https://www.ercot.com/mp/data-products/data-product-details?id=NP6-788-CD'
from bs4 import BeautifulSoup
from requests_html import HTMLSession
from shutil import copyfileobj
session = HTMLSession()
resp = session.get(url)
resp.html.render()
soup1 = BeautifulSoup(resp.html.html, "lxml").find_all("td")[::1]
zip_urls1 = [a.get('title') for a in soup1 if a.get('title') is not None]
soup = BeautifulSoup(resp.html.html, "lxml").find_all("a")
zip_urls2 = [a.get('href') for a in soup if 'doclookupId' in a.get('href')]
Solution
That ercot site is a nightmare to access with all that IP blocking, failing requests, and even certificate pinning...
Having said all that, the URL you gave is just a presentation layer. The data comes from an entirely different endpoint.
Yet, my previous answer to a similar question of yours still works, but you need to add this:
type_id = "12300"
endpoint = f"https://www.ercot.com/misapp/servlets/" \
f"IceDocListJsonWS?reportTypeId={type_id}&_={int(time. Time())}"
This works like a charm and produces this:
Downloading LMPSROSNODENP6788_20230514_000017_csv...
Downloading LMPSROSNODENP6788_20230514_000017_xml...
Process finished with exit code 0
with the final result of:
4.00 KiB ├─ cdr.00012300.0000000000000000.20230514.130517630.LMPSROSNODENP6788_20230514_130512_xml.zip
4.00 KiB ├─ cdr.00012300.0000000000000000.20230515.065520303.LMPSROSNODENP6788_20230515_065516_csv.zip
4.00 KiB ├─ cdr.00012300.0000000000000000.20230516.223532628.LMPSROSNODENP6788_20230516_223521_xml.zip
4.00 KiB ├─ cdr.00012300.0000000000000000.20230514.170019466.LMPSROSNODENP6788_20230514_170015_xml.zip
4.00 KiB ├─ cdr.00012300.0000000000000000.20230514.160517385.LMPSROSNODENP6788_20230514_160511_xml.zip
4.00 KiB ├─ cdr.00012300.0000000000000000.20230514.025517229.LMPSROSNODENP6788_20230514_025512_csv.zip
11.46 MiB zip_files
2934 files
Here's the ENTIRE dump of all the 2934 .csv
and .xml
files from the URL you gave (it's on my OneDrive):
However, if you feel like running the code yourself, here's an updated version:
import os
import time
from pathlib import Path
from shutil import copyfileobj
import requests
type_id = "12300"
endpoint = f"https://www.ercot.com/misapp/servlets/" \
f"IceDocListJsonWS?reportTypeId={type_id}&_={int(time.time())}"
headers = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Encoding": "gzip, deflate, utf-8",
"Host": "www.ercot.com",
"Referer": "https://www.ercot.com/mp/data-products/data-product-details?id=NP7-802-M",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:108.0) Gecko/20100101 Firefox/108.0",
"X-Requested-With": "XMLHttpRequest"
}
os.makedirs("zip_files", exist_ok=True)
download_url = "https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId="
with requests.Session() as s:
auction_results = s.get(endpoint, headers=headers).json()
for result in auction_results["ListDocsByRptTypeRes"]["DocumentList"]:
file_name = result["Document"]["ConstructedName"]
zip_url = f"{download_url}{result['Document']['ReportTypeID']}"
print(f"Downloading {result['Document']['FriendlyName']}...")
r = s.get(zip_url, headers=headers, stream=True)
with open(Path("zip_files") / file_name, 'wb') as f:
copyfileobj(r.raw, f)
Final note: the code performs flawlessly on a VPN connection with a server in Huston, Texsas.
Answered By - baduker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.