Thursday, April 14, 2022

[FIXED] How to distinguish two tables with the same relative XPATH with Selenium in Python

April 14, 2022 imdb, python, selenium, web-scraping No comments

Issue

I'm trying to scrape some data from IMDb (with selenium in Python), but I have a problem. For each movie I have to fetch directors and writers. Both elements are contained in two tables and they have the same @class. I need to distinguish the two tables when I scrape, otherwise sometimes the program could fetch a writer as a director and vice versa.

I've tried to use relative XPATH to find all elements (tables) with that xpath and then put them in a loop where I try to distinguish them trough table title (that is a h4 element) and preceding-sibling function. The code works, but it do not find anything (everytime it returns nan).

This is my code:

    counter = 1
    try:
        driver.get('https://www.imdb.com/title/' + tt + '/fullcredits/?ref_=tt_cl_sm')
        ssleep()
        tables = driver.find_elements(By.XPATH, '//table[@class="simpleTable simpleCreditsTable"]/tbody')
        counter = 1
        for table in tables:
            xpath_table = f'//table[@class="simpleTable simpleCreditsTable"]/tbody[{counter}]' 
            xpath_h4 = xpath_table + "/preceding-sibling::h4[1]/text()"
            table_title = driver.find_element(By.XPATH, xpath_h4).text
            if table_title == "Directed by":
                rows_director = table.find_elements(By.CSS_SELECTOR, 'tr')
                for row in rows_director:
                    director = row.find_elements(By.CSS_SELECTOR, 'a')
                    director = [x.text for x in director]
                    if len(director) == 1:
                        director = ''.join(map(str, director))
                    else:
                        director = ', '.join(map(str, director))
                        director_list.append(director)
        counter += 1

    except NoSuchElementException:
        # director = np.nan
        director_list.append(np.nan)

Can any of you tell me why it doesn't work? Perhaps there is a better solution. I hope for your help.

(here you can find an example of the page I need to scrape: https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm)

Solution

To extract the names and directors and writers of each movie within an imdb.com you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following locator strategies:

Using CSS_SELECTOR:

driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#director +table > tbody tr > td > a")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#writer +table > tbody tr > td > a")))])

Using XPATH:

driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='director']//following::table[1]/tbody//tr/td/a")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='writer']//following::table[1]/tbody//tr/td/a")))])

Console Output:

['Matt Reeves']
['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Answered By - undetected Selenium

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, April 14, 2022

[FIXED] How to distinguish two tables with the same relative XPATH with Selenium in Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels