Sunday, June 26, 2022

[FIXED] Scraping a website that has a "Load more" button doesn't return info of newly loaded items with Beautiful Soup and Selenium

June 26, 2022 beautifulsoup, python, selenium No comments

Issue

So I'm using Selenium to press the "Load more" button and everything loads properly. Then I want to get the info of all the loaded products but I only get the info of the first 36 items that are before the first "Load more" button.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import json
import time
import requests
allinfo=[]
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])
url="https://zadaa.co/de-en/products/women/clothes-dresses/"
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),chrome_options=chrome_options)
driver.get(url)
r=requests.get(url)
soup=BeautifulSoup(r.content,"html.parser")
wait = WebDriverWait(driver, 10)
closebutton=wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="content"]/div[5]/button')))
closebutton.click()
for x in range(9):
    button = wait.until(EC.element_to_be_clickable((By.ID, "load-more-products")))
    button.click()
content=soup.find_all('a',class_='product-list-item')
for properties in content:
    brand=properties.find("p",class_='product-list-item-title').text
    info={
        'name':brand,
    }
    allinfo.append(info)
df=pd.DataFrame(allinfo)
print(df.head())
df.to_csv('zadaa.csv')

This is the web page I'm trying to scrape- https://zadaa.co/de-en/products/women/clothes-dresses/

Sorry for some weird English usage.

Solution

You can simulate Ajax calls with requests module to get the data directly, without selenium (beware, there are 12k+ products):

import requests
from bs4 import BeautifulSoup


url = "https://zadaa.co/de-en/products/women/clothes-dresses/"
api_url = "https://zadaa.co/wp-admin/admin-ajax.php"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

payload = {
    "action": "get_more_products",
    "lang": "de-en",
    "security": "05ef973f4c",
    "query_id": soup.select_one("[data-query-id]")["data-query-id"],
    "offset": 0,
}


while True:
    data = requests.post(api_url, data=payload).json()
    if not data["success"]:
        break

    soup = BeautifulSoup(data["data"], "html.parser")

    for i in soup.select(".product-list-item"):
        print(i.select_one(".product-list-item-title").text)
        print(i["href"])
        print("-" * 80)

    payload["offset"] += 36

Prints:

...

CITY GIRL PARIS
https://zadaa.co/de-en/products/women/clothes-dresses/city-girl-paris/3735824/
--------------------------------------------------------------------------------
ZAFUL
https://zadaa.co/de-en/products/women/clothes-dresses/zaful/3735781/
--------------------------------------------------------------------------------
NKD
https://zadaa.co/de-en/products/women/clothes-dresses/nkd/3735768/
--------------------------------------------------------------------------------
GREAT RUMORS
https://zadaa.co/de-en/products/women/clothes-dresses/great-rumors/3735762/
--------------------------------------------------------------------------------

...and so on.

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, June 26, 2022

[FIXED] Scraping a website that has a "Load more" button doesn't return info of newly loaded items with Beautiful Soup and Selenium

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels