Monday, December 11, 2023

[FIXED] bs4 won't find urls on ercot.com site

December 11, 2023 beautifulsoup, python, python-3.8, web-scraping No comments

Issue

I need to extract all url from the Elements which you can see by right clicking on chrome and doing inspect.

url = fr'https://www.ercot.com/mp/data-products/data-product-details?id=NP6-788-CD'

The url displayed on the right is seen when you do inspect on zip on left in following image:

I am trying the following but both zip_urls1 and zip_urls2 are empty:

url = fr'https://www.ercot.com/mp/data-products/data-product-details?id=NP6-788-CD'
from bs4 import BeautifulSoup
from requests_html import HTMLSession
from shutil import copyfileobj
session = HTMLSession()
resp = session.get(url)
resp.html.render()
soup1 = BeautifulSoup(resp.html.html, "lxml").find_all("td")[::1]
zip_urls1 = [a.get('title') for a in soup1 if a.get('title') is not None]

soup = BeautifulSoup(resp.html.html, "lxml").find_all("a")
zip_urls2 = [a.get('href') for a in soup if 'doclookupId' in a.get('href')]

Solution

That ercot site is a nightmare to access with all that IP blocking, failing requests, and even certificate pinning...

Having said all that, the URL you gave is just a presentation layer. The data comes from an entirely different endpoint.

Yet, my previous answer to a similar question of yours still works, but you need to add this:

type_id = "12300"
endpoint = f"https://www.ercot.com/misapp/servlets/" \
           f"IceDocListJsonWS?reportTypeId={type_id}&_={int(time. Time())}"

This works like a charm and produces this:

Downloading LMPSROSNODENP6788_20230514_000017_csv...
Downloading LMPSROSNODENP6788_20230514_000017_xml...

Process finished with exit code 0

with the final result of:

 4.00 KiB ├─ cdr.00012300.0000000000000000.20230514.130517630.LMPSROSNODENP6788_20230514_130512_xml.zip
 4.00 KiB ├─ cdr.00012300.0000000000000000.20230515.065520303.LMPSROSNODENP6788_20230515_065516_csv.zip
 4.00 KiB ├─ cdr.00012300.0000000000000000.20230516.223532628.LMPSROSNODENP6788_20230516_223521_xml.zip
 4.00 KiB ├─ cdr.00012300.0000000000000000.20230514.170019466.LMPSROSNODENP6788_20230514_170015_xml.zip
 4.00 KiB ├─ cdr.00012300.0000000000000000.20230514.160517385.LMPSROSNODENP6788_20230514_160511_xml.zip
 4.00 KiB ├─ cdr.00012300.0000000000000000.20230514.025517229.LMPSROSNODENP6788_20230514_025512_csv.zip
11.46 MiB zip_files

2934 files

Here's the ENTIRE dump of all the 2934 .csv and .xml files from the URL you gave (it's on my OneDrive):

THE ERCOT DUMP

However, if you feel like running the code yourself, here's an updated version:

import os
import time
from pathlib import Path
from shutil import copyfileobj

import requests

type_id = "12300"
endpoint = f"https://www.ercot.com/misapp/servlets/" \
           f"IceDocListJsonWS?reportTypeId={type_id}&_={int(time.time())}"

headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding": "gzip, deflate, utf-8",
    "Host": "www.ercot.com",
    "Referer": "https://www.ercot.com/mp/data-products/data-product-details?id=NP7-802-M",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:108.0) Gecko/20100101 Firefox/108.0",
    "X-Requested-With": "XMLHttpRequest"
}

os.makedirs("zip_files", exist_ok=True)
download_url = "https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId="

with requests.Session() as s:
    auction_results = s.get(endpoint, headers=headers).json()
    for result in auction_results["ListDocsByRptTypeRes"]["DocumentList"]:
        file_name = result["Document"]["ConstructedName"]
        zip_url = f"{download_url}{result['Document']['ReportTypeID']}"
        print(f"Downloading {result['Document']['FriendlyName']}...")
        r = s.get(zip_url, headers=headers, stream=True)
        with open(Path("zip_files") / file_name, 'wb') as f:
            copyfileobj(r.raw, f)

Final note: the code performs flawlessly on a VPN connection with a server in Huston, Texsas.

Answered By - baduker

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 11, 2023

[FIXED] bs4 won't find urls on ercot.com site

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels