Monday, February 5, 2024

[FIXED] infinit scroll issue in python selenium

February 05, 2024 python, selenium-webdriver No comments

Issue

I am trying to scrape information from a site https://www.kw.com/agent/search/ca. I can move to desired elements by using Actionchains.The problem is scrolling further deep down.The script stucks after completing first 50 elements . how should I modify my script to load further elements. It would be very helpful if someone gives me a workaround. Thanks

import time

import requests
from bs4 import BeautifulSoup
import json
from selenium import webdriver
from selenium.webdriver import ActionChains, Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import re


driver = webdriver.Chrome()
driver.get("https://www.kw.com/agent/search/ca/")
driver.maximize_window()
# Locate the element with the accept name
element = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#__next > div.app.app--consumer > div.KWBanner.KWBanner--cookieAcceptance > div > button")))
# Click the element
ActionChains(driver).move_to_element(driver.find_element(By.CSS_SELECTOR, "#__next > div.app.app--consumer > div.KWBanner.KWBanner--cookieAcceptance > div > button")).click().perform()

html_content = '<div class="row FindAgentRoute__totalCount"><div class="col-6 col-md-4 col-l-4 col-xl-4">Showing 1,188 Agents</div></div>'

# Use regular expression to extract the number
match = re.search(r'Showing ([0-9,]+) Agents', html_content)
result = 0
if match:
    # Get the matched number and remove commas, then convert to int
    number_str = match.group(1).replace(',', '')
    result = int(number_str)
    print(result)
else:
    print("Number not found in the string.")
def scroll_page():
    actions = ActionChains(driver)
    actions.send_keys(Keys.PAGE_DOWN).perform()


def getdata(response):
    htm = response
    soup = BeautifulSoup(htm, 'lxml')
    json_data = json.loads(soup.find('script', {'id': '__NEXT_DATA__'}).get_text())
    name = json_data['props']['pageProps']['agentData']['name']['full']
    city = json_data['props']['pageProps']['agentData']['location']['city']
    state = json_data['props']['pageProps']['agentData']['location']['state']
    email = json_data['props']['pageProps']['agentData']['email']
    website = json_data['props']['pageProps']['agentData']['website']
    print(f"{name}, {city}, {state}, {email}, {website}")
    pd.DataFrame([[name, city, state, email, website]], columns=['Name', 'City', 'State', 'Email', 'Website']).to_csv('kw.csv', mode='a', header=False)

#wait until acceptance of cookie button with id "onetrust-accept-btn-handler" is clicked
#element = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME, "KWButton KWButton--primary KWButton--red")))
#ActionChains(driver).move_to_element(driver.find_element(By.CLASS_NAME, "KWButton KWButton--primary KWButton--red")).click().perform()
element = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME, "AgentCard")))
elements = driver.find_elements(By.CLASS_NAME, "AgentCard")
print(len(elements))

try:
    for counter in range(40, result + 1):
        try:
            WebDriverWait(driver, 100)
            xpath = f'//*[@id="kw-skip-nav"]/div/div[2]/div/div/div/div[2]/div/div[{counter}]/div'
            print(xpath)
            elem = WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, xpath)))

            ActionChains(driver).move_to_element(driver.find_element(By.XPATH, xpath)).click().perform()
            print(driver.current_url)
            WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.CLASS_NAME, "AgentContent__section")))

            getdata(requests.get(driver.current_url).text)

            driver.back()
            if counter % 50 == 0:

                scroll_page()
                time.sleep(4)

        except:
            print("no luck")



except Exception as e:
    print(e)
    print("broken link")

finally:
    driver.quit()

Solution

Any scroll should be finite :)

In your case, there is a label which shows how many agents are there. You can utilize that number to figure out how many records you want to collect.

You can use the following approach:

Get the target number of iterations
start collecting records from starting row - 0 (first agent) to the current max number of displayed rows
after collecting the last one, use ActionChains to move to that last element (maybe add some additional 10px scroll if it is needed), this will trigger the loading of the next 'portion' of agents
change the starting row number - to the number of the row you've moved to
start the next loop iteration from the n-th row to the current max number of displayed rows (it should be the previous number + a portion of newly loaded ones)

You can also add some iterations control mechanism, to not end up with the infinite loop (it may be caused by some UI changes/lags)

I don't know your exact business case (I see you are using bs4 also), but the following code works ok for me (1190 names were collected):

driver = webdriver.Chrome()
driver.maximize_window()
wait = WebDriverWait(driver, 5)
actions = ActionChains(driver)

try:
    driver.get("https://www.kw.com/agent/search/ca")

    # get total count of agents
    total_count = 0
    total_count_label = wait.until(
        EC.visibility_of_element_located((By.XPATH, '//div[contains(@class, "FindAgentRoute__totalCount")]'))).text
    match = re.search(r'\b(\d{1,3}(,\d{3})*|\d+)\b', total_count_label)
    if match:
        total_count = int(match.group().replace(',', ''))

    iteration = 0
    tail_agent_id = 0
    collected_agents = []
    while len(collected_agents) != total_count:
        if iteration > total_count / 2:
            raise RuntimeError("Too many iterations")

        for i in range(tail_agent_id, len(driver.find_elements(By.XPATH, '//div[@class="AgentCard"]'))):
            el = wait.until(
                EC.visibility_of_element_located((By.XPATH, f'(//div[@class="AgentCard"])[{i + 1}]'))
            )

            # your parsing logic goes here
            # your parsing logic goes here
            # your parsing logic goes here

            name = el.find_element(By.XPATH, './/div[contains(@class, "AgentCard__name")]').text

            collected_agents.append(name)
            print("Collected: " + name)

        tail_agent_id = len(collected_agents)
        actions.move_to_element(
            driver.find_element(By.XPATH, f'(//div[@class="AgentCard"])[{tail_agent_id}]')
        ).perform()

        iteration += 1
finally:
    driver.quit()

You can adjust it for your needs, i.e. leave only moving to the last element action, and do the parsing with bs4.

Answered By - sashkins

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, February 5, 2024

[FIXED] infinit scroll issue in python selenium

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels