Sunday, January 28, 2024

[FIXED] Python: Scraping non-visible historical crude oil data from dynamic javascript table from Mexican Energy website?

January 28, 2024 beautifulsoup, python, python-requests, selenium-webdriver, web-scraping No comments

Issue

In Python, I can html scrape the 2023 data that is visible when you go to the website, but since the table is interactive, I cannot scrape previous data (2022 for example) without using the selenium library I believe. I am having trouble incorporating this into my working html scrape (given below).

Hi all,

I am trying to automate a process of going to the following website (https://sie.energia.gob.mx/bdiController.do?action=cuadro&cvecua=PMXC1C01E) and was wondering if anyone had some insight into retrieving historical data from the given table? It automatically displays Jan 2023-May 2023, but you have to set the options at the top to have the data begin at my desired time period of Jan 2018. I am having issues with selenium and am not good at reading html and directing the library where to go. I have also tried to use http headers to automatically have the data present but to no avail. Below is working code that retrieves the 2023 data, but I would like to combine this with the selenium library so it autoselects the date selection, then this code will read the resulting html from the webdriver. Please let me know if anyone has any follow-up questions. Im sorry if this wasnt explained well enough as this is my first time asking a question on stack overflow. Thank you.

import pandas as pd
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.select import Select

#setting up
url = "https://sie.energia.gob.mx/bdiController.do?action=cuadro&cvecua=PMXC1C01E"
webdriver_path = 'my_path'
chrome_options = Options()
driver = webdriver.Chrome(service=Service(webdriver_path), options=chrome_options)

#open url
driver.get(url)

#find the "opciones" button and click it
opciones_button = driver.find_element(By.ID, "opciones")
opciones_button.click()

#January is my desired start month, and I want the most updated data, so I do not need to edit any other dropdown options besides start year (ano inicial)

#change the start year for the dynamic js table to 2018 instead of 2023
start_year_select = Select(driver.find_element(By.NAME, "anoini"))
start_year_select.select_by_value("2018")



#note that the rest of the code wont work until the accept button can be clicked and the changes can be applied

#find the "aceptar" button and click it
# aceptar_button = driver.find_element(By.NAME, "Aceptar")
# aceptar_button.click()

#allow data to load
time.sleep(10)

#get the html content with all pertinent historical data
html_content = driver.page_source

#close browser
driver.quit()

#parse the html
soup = BeautifulSoup(html_content, "html.parser")

#convert to pandas dataframe
row = soup.find('td', class_='descripcion bold level-0').parent
cells = row.find_all('td')
df = pd.DataFrame([cell.text.strip() for cell in cells]).transpose()

df

Solution

You can do a simplified version of the POST request the page does for updating content, specifying your custom date range. No need for the overhead of selenium. A Session is used as a session cookie is expected by server.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

headers = {
    "user-agent": "Mozilla/5.0",
}

params = {
    "action": "cuadro",
    "subAction": "applyOptions",
}

data = {
    "datosde": "REALES",
    "periodicidad": "1",
    "mesini": "01",
    "anoini": "2018",
    "mesfin": "05",
    "anofin": "2023",
    "datosdeSelect2": "REALES",
    "anocompararSelect": "2023",
    "unidador": "Mbd",
    "unidadde": "b",
    "variaRespectoRadio": "mismoperiodo",
    "varPeriodoFijoSelect": "01",
    "varAnoFijoSelect": "2023",
    "columnaComparaRadio": "variacion",
    "tipoVariacionRadio": "RELATIVA",
    "lineaParametros": "MENSUAL,01/2018-05/2023,REALES",
    "lineaParametrosLabel": "MENSUAL,01/2018-05/2023,REALES",
    "lineaUnidades": "",
    "nParam": "0",
    "tipoParam": "1",
    "avanzadas": "false",
}

with requests.Session() as s:
    r = s.get(
        "https://sie.energia.gob.mx/bdiController.do?action=cuadro&cvecua=PMXC1C01E"
    )
    r = s.post(
        "https://sie.energia.gob.mx/bdiController.do",
        params=params,
        headers=headers,
        data=data,
    ).text

soup = bs(r, "lxml")

table = soup.select_one(
    "#cuadroTable"
)  # grab table. You will need to write code to turn into the desired output format
check_periods = [i.text.strip() for i in table.select(".th td")][2:]
print(check_periods)  # confirm returned dates

Answered By - QHarr

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 28, 2024

[FIXED] Python: Scraping non-visible historical crude oil data from dynamic javascript table from Mexican Energy website?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels