Tuesday, January 30, 2024

[FIXED] USing Python selenium to extract attribute with no class, id, or other unique identifier

January 30, 2024 attributes, python, selenium-webdriver No comments

Issue

I need to extract text using Selenium for html inside a table. There are no unique classes, id's or other identifiers I can use. The lines look like this. I need the " Cost Elements" text.

<th align="LEFT" bgcolor="7DA6CF" width="350 px" overflow="HIDDEN"> Cost Elements</th>

Here is the full block of html.

<tr>
  <th align="LEFT" bgcolor="7DA6CF" width="350 px" overflow="HIDDEN"> Cost Elements</th>
  <th align="CENTER" bgcolor="7DA6CF" width="160 px" overflow="HIDDEN"> Plan</th>                                                             
  <th align="CENTER" bgcolor="7DA6CF" width="160 px" overflow="HIDDEN"> Period 6</th>
  <th align="CENTER" bgcolor="7DA6CF" width="160 px" overflow="HIDDEN"> Cumulative Act. 
    </th>
  <th align="CENTER" bgcolor="7DA6CF" width="160 px" overflow="HIDDEN"> Commitments</th>
  <th align="CENTER" bgcolor="7DA6CF" width="160 px" overflow="HIDDEN"> $ Variance</th>
  <th align="CENTER" bgcolor="7DA6CF" width="90 px " overflow="HIDDEN"> % Remain</th>
</tr>

Here is my code, if it helps. Table1_cols is where I'm trying to extract the table column names.

from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By

service = Service(executable_path = 'C:\Program Files\edgedriver_win64\msedgedriver.exe')
driver = webdriver.Edge(service=service)
driver.get('C:\\Users\\User\\Downloads\\_SAPreport-behnke r-20240102.HTM_.HTM')
ne_mesonet_table = driver.find_element(By.LINK_TEXT, "Nebraska Mesonet")
ne_mesonet_table.click()

ne_mesonet_xpath1 = '//html//body//table[1]//tbody'
table1 = driver.find_element(By.XPATH, ne_mesonet_xpath1)
table1_rows = table1.find_elements(By.TAG_NAME, "tr")  
table1_cols = table1_rows[0].find_elements(By.TAG_NAME, 'th')

Solution

So you are trying to collect some data from a certain column, where no unique attributes are present for the column.

If you are 100% sure that the table won't be changed, you can use the fixed number, e.g. //table/tbody/tr/td[5] for the 5th column. But from practice, you can never be 100% sure :)

A more correct approach will be to iterate over all columns to find the required index. See this example based on the the-internet.herokuapp.com:

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium import webdriver

driver = webdriver.Chrome()
driver.maximize_window()
wait = WebDriverWait(driver, 5)

try:
    driver.get("https://the-internet.herokuapp.com/tables")

    # find a column
    table_headers = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='table1']/thead//th")))
    target_column = "Email"
    for i, header in enumerate(table_headers):
        if header.text == target_column:
            column_index = i + 1
            break
    else:
        raise RuntimeError(f"Target column '{target_column}' is not found in the table.")

    # collect data
    table_rows = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='table1']/tbody/tr")))
    all_emails = []
    for row in table_rows:
        column = row.find_element(By.XPATH, f"./td[{column_index}]")
        all_emails.append(column.text)

    print(f"All '{target_column}' collected: {all_emails}")
finally:
    driver.quit()

In this example 'Email' value from all rows will be collected.

If you need to collect all the data from the table, you can create a list of dictionaries (key=header, value = ./td[header_index].text)

Answered By - sashkins

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 30, 2024

[FIXED] USing Python selenium to extract attribute with no class, id, or other unique identifier

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels