Issue
I am trying to scrape information from a site https://www.kw.com/agent/search/ca. I can move to desired elements by using Actionchains.The problem is scrolling further deep down.The script stucks after completing first 50 elements . how should I modify my script to load further elements. It would be very helpful if someone gives me a workaround. Thanks
import time
import requests
from bs4 import BeautifulSoup
import json
from selenium import webdriver
from selenium.webdriver import ActionChains, Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import re
driver = webdriver.Chrome()
driver.get("https://www.kw.com/agent/search/ca/")
driver.maximize_window()
# Locate the element with the accept name
element = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#__next > div.app.app--consumer > div.KWBanner.KWBanner--cookieAcceptance > div > button")))
# Click the element
ActionChains(driver).move_to_element(driver.find_element(By.CSS_SELECTOR, "#__next > div.app.app--consumer > div.KWBanner.KWBanner--cookieAcceptance > div > button")).click().perform()
html_content = '<div class="row FindAgentRoute__totalCount"><div class="col-6 col-md-4 col-l-4 col-xl-4">Showing 1,188 Agents</div></div>'
# Use regular expression to extract the number
match = re.search(r'Showing ([0-9,]+) Agents', html_content)
result = 0
if match:
# Get the matched number and remove commas, then convert to int
number_str = match.group(1).replace(',', '')
result = int(number_str)
print(result)
else:
print("Number not found in the string.")
def scroll_page():
actions = ActionChains(driver)
actions.send_keys(Keys.PAGE_DOWN).perform()
def getdata(response):
htm = response
soup = BeautifulSoup(htm, 'lxml')
json_data = json.loads(soup.find('script', {'id': '__NEXT_DATA__'}).get_text())
name = json_data['props']['pageProps']['agentData']['name']['full']
city = json_data['props']['pageProps']['agentData']['location']['city']
state = json_data['props']['pageProps']['agentData']['location']['state']
email = json_data['props']['pageProps']['agentData']['email']
website = json_data['props']['pageProps']['agentData']['website']
print(f"{name}, {city}, {state}, {email}, {website}")
pd.DataFrame([[name, city, state, email, website]], columns=['Name', 'City', 'State', 'Email', 'Website']).to_csv('kw.csv', mode='a', header=False)
#wait until acceptance of cookie button with id "onetrust-accept-btn-handler" is clicked
#element = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME, "KWButton KWButton--primary KWButton--red")))
#ActionChains(driver).move_to_element(driver.find_element(By.CLASS_NAME, "KWButton KWButton--primary KWButton--red")).click().perform()
element = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME, "AgentCard")))
elements = driver.find_elements(By.CLASS_NAME, "AgentCard")
print(len(elements))
try:
for counter in range(40, result + 1):
try:
WebDriverWait(driver, 100)
xpath = f'//*[@id="kw-skip-nav"]/div/div[2]/div/div/div/div[2]/div/div[{counter}]/div'
print(xpath)
elem = WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, xpath)))
ActionChains(driver).move_to_element(driver.find_element(By.XPATH, xpath)).click().perform()
print(driver.current_url)
WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.CLASS_NAME, "AgentContent__section")))
getdata(requests.get(driver.current_url).text)
driver.back()
if counter % 50 == 0:
scroll_page()
time.sleep(4)
except:
print("no luck")
except Exception as e:
print(e)
print("broken link")
finally:
driver.quit()
Solution
Any scroll should be finite :)
In your case, there is a label which shows how many agents are there. You can utilize that number to figure out how many records you want to collect.
You can use the following approach:
- Get the target number of iterations
- start collecting records from starting row - 0 (first agent) to the current max number of displayed rows
- after collecting the last one, use ActionChains to move to that last element (maybe add some additional 10px scroll if it is needed), this will trigger the loading of the next 'portion' of agents
- change the starting row number - to the number of the row you've moved to
- start the next loop iteration from the n-th row to the current max number of displayed rows (it should be the previous number + a portion of newly loaded ones)
You can also add some iterations control mechanism, to not end up with the infinite loop (it may be caused by some UI changes/lags)
I don't know your exact business case (I see you are using bs4 also), but the following code works ok for me (1190 names were collected):
driver = webdriver.Chrome()
driver.maximize_window()
wait = WebDriverWait(driver, 5)
actions = ActionChains(driver)
try:
driver.get("https://www.kw.com/agent/search/ca")
# get total count of agents
total_count = 0
total_count_label = wait.until(
EC.visibility_of_element_located((By.XPATH, '//div[contains(@class, "FindAgentRoute__totalCount")]'))).text
match = re.search(r'\b(\d{1,3}(,\d{3})*|\d+)\b', total_count_label)
if match:
total_count = int(match.group().replace(',', ''))
iteration = 0
tail_agent_id = 0
collected_agents = []
while len(collected_agents) != total_count:
if iteration > total_count / 2:
raise RuntimeError("Too many iterations")
for i in range(tail_agent_id, len(driver.find_elements(By.XPATH, '//div[@class="AgentCard"]'))):
el = wait.until(
EC.visibility_of_element_located((By.XPATH, f'(//div[@class="AgentCard"])[{i + 1}]'))
)
# your parsing logic goes here
# your parsing logic goes here
# your parsing logic goes here
name = el.find_element(By.XPATH, './/div[contains(@class, "AgentCard__name")]').text
collected_agents.append(name)
print("Collected: " + name)
tail_agent_id = len(collected_agents)
actions.move_to_element(
driver.find_element(By.XPATH, f'(//div[@class="AgentCard"])[{tail_agent_id}]')
).perform()
iteration += 1
finally:
driver.quit()
You can adjust it for your needs, i.e. leave only moving to the last element action, and do the parsing with bs4.
Answered By - sashkins
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.