Monday, November 7, 2022

[FIXED] BeautifulSoup with Selenium for ASPX

November 07, 2022 asp.net, beautifulsoup, python, selenium No comments

Issue

I am trying to scrape this page (AAG is only as example):

https://bvb.ro/FinancialInstruments/Details/FinancialInstrumentsDetails.aspx?s=AAG

The main 'issue' is most of the page's content changes when cycling through the 'buttons' (<input type='submit') under the ctl00_body_IFTC_btnlist <div> (visible as Overview / Trading / Charts / News / Financials / Issuer profile for the English version).

Using Selenium with a Chrome (version 98) driver, I am able to navigate through the subsections (via XPATH):

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

options = Options()
# options.add_argument('--headless')
options.add_argument("--start-maximized")
options.add_argument('--disable-gpu')
options.add_argument('--log-level=3')
driver = webdriver.Chrome(options=options, executable_path=r'D:\\Python\\workspace\\vs-code\\chromedriver\\chromedriver.exe')

driver.get('https://bvb.ro/FinancialInstruments/Details/FinancialInstrumentsDetails.aspx?s=AAG')
link = driver.find_element_by_xpath('/html/body/form/div[3]/div/div[1]/div[2]/div/div[1]/div/div/input[2]')
driver.execute_script('arguments[0].click()', link)

(Please note, I use --start-maximized not only for easier troubleshooting, but also because --headless gets blocked.)

My main issue is when I try to parse the page after having 'clicked the button'. Namely, if I do soup = BeautifulSoup(driver.page_source, 'lxml'), I still have the initial page as the URL default opens (on the first subsection, Overview).

This is consistent with manual navigation (through those 6 subsections) via a Chrome browser. The URL never changes, and if I do Right Click -> View page source I always have the initial version.

Now, if I (manually) do Right Click -> Inspect on an element of interest, I do find what I am looking for.

I am not sure how to best get this done programmatically ('navigate' through a page using Selenium, but also be able to parse the 'updated' content with BeautifulSoup).

Edit: Answered.

Solution

Turns out the driver object holds the exact information I need.

So, what I do is:

driver.find_element_by_id('ID_OF_ELEMENT').get_attribute('innerHTML')

Answered By - KwisatzHaderachDev

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, November 7, 2022

[FIXED] BeautifulSoup with Selenium for ASPX

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels