Issue
I need to extract text using Selenium for html inside a table. There are no unique classes, id's or other identifiers I can use. The lines look like this. I need the " Cost Elements" text.
<th align="LEFT" bgcolor="7DA6CF" width="350 px" overflow="HIDDEN"> Cost Elements</th>
Here is the full block of html.
<tr>
<th align="LEFT" bgcolor="7DA6CF" width="350 px" overflow="HIDDEN"> Cost Elements</th>
<th align="CENTER" bgcolor="7DA6CF" width="160 px" overflow="HIDDEN"> Plan</th>
<th align="CENTER" bgcolor="7DA6CF" width="160 px" overflow="HIDDEN"> Period 6</th>
<th align="CENTER" bgcolor="7DA6CF" width="160 px" overflow="HIDDEN"> Cumulative Act.
</th>
<th align="CENTER" bgcolor="7DA6CF" width="160 px" overflow="HIDDEN"> Commitments</th>
<th align="CENTER" bgcolor="7DA6CF" width="160 px" overflow="HIDDEN"> $ Variance</th>
<th align="CENTER" bgcolor="7DA6CF" width="90 px " overflow="HIDDEN"> % Remain</th>
</tr>
Here is my code, if it helps. Table1_cols is where I'm trying to extract the table column names.
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.common.by import By
service = Service(executable_path = 'C:\Program Files\edgedriver_win64\msedgedriver.exe')
driver = webdriver.Edge(service=service)
driver.get('C:\\Users\\User\\Downloads\\_SAPreport-behnke r-20240102.HTM_.HTM')
ne_mesonet_table = driver.find_element(By.LINK_TEXT, "Nebraska Mesonet")
ne_mesonet_table.click()
ne_mesonet_xpath1 = '//html//body//table[1]//tbody'
table1 = driver.find_element(By.XPATH, ne_mesonet_xpath1)
table1_rows = table1.find_elements(By.TAG_NAME, "tr")
table1_cols = table1_rows[0].find_elements(By.TAG_NAME, 'th')
Solution
So you are trying to collect some data from a certain column, where no unique attributes are present for the column.
If you are 100% sure that the table won't be changed, you can use the fixed number, e.g. //table/tbody/tr/td[5] for the 5th column. But from practice, you can never be 100% sure :)
A more correct approach will be to iterate over all columns to find the required index. See this example based on the the-internet.herokuapp.com:
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
wait = WebDriverWait(driver, 5)
try:
driver.get("https://the-internet.herokuapp.com/tables")
# find a column
table_headers = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='table1']/thead//th")))
target_column = "Email"
for i, header in enumerate(table_headers):
if header.text == target_column:
column_index = i + 1
break
else:
raise RuntimeError(f"Target column '{target_column}' is not found in the table.")
# collect data
table_rows = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='table1']/tbody/tr")))
all_emails = []
for row in table_rows:
column = row.find_element(By.XPATH, f"./td[{column_index}]")
all_emails.append(column.text)
print(f"All '{target_column}' collected: {all_emails}")
finally:
driver.quit()
In this example 'Email' value from all rows will be collected.
If you need to collect all the data from the table, you can create a list of dictionaries (key=header, value = ./td[header_index].text)
Answered By - sashkins
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.