Issue
So I'm using Selenium to press the "Load more" button and everything loads properly. Then I want to get the info of all the loaded products but I only get the info of the first 36 items that are before the first "Load more" button.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import json
import time
import requests
allinfo=[]
chrome_options = Options()
chrome_options.add_experimental_option("detach", True)
chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])
url="https://zadaa.co/de-en/products/women/clothes-dresses/"
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),chrome_options=chrome_options)
driver.get(url)
r=requests.get(url)
soup=BeautifulSoup(r.content,"html.parser")
wait = WebDriverWait(driver, 10)
closebutton=wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="content"]/div[5]/button')))
closebutton.click()
for x in range(9):
button = wait.until(EC.element_to_be_clickable((By.ID, "load-more-products")))
button.click()
content=soup.find_all('a',class_='product-list-item')
for properties in content:
brand=properties.find("p",class_='product-list-item-title').text
info={
'name':brand,
}
allinfo.append(info)
df=pd.DataFrame(allinfo)
print(df.head())
df.to_csv('zadaa.csv')
This is the web page I'm trying to scrape- https://zadaa.co/de-en/products/women/clothes-dresses/
Sorry for some weird English usage.
Solution
You can simulate Ajax calls with requests
module to get the data directly, without selenium
(beware, there are 12k+ products):
import requests
from bs4 import BeautifulSoup
url = "https://zadaa.co/de-en/products/women/clothes-dresses/"
api_url = "https://zadaa.co/wp-admin/admin-ajax.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
payload = {
"action": "get_more_products",
"lang": "de-en",
"security": "05ef973f4c",
"query_id": soup.select_one("[data-query-id]")["data-query-id"],
"offset": 0,
}
while True:
data = requests.post(api_url, data=payload).json()
if not data["success"]:
break
soup = BeautifulSoup(data["data"], "html.parser")
for i in soup.select(".product-list-item"):
print(i.select_one(".product-list-item-title").text)
print(i["href"])
print("-" * 80)
payload["offset"] += 36
Prints:
...
CITY GIRL PARIS
https://zadaa.co/de-en/products/women/clothes-dresses/city-girl-paris/3735824/
--------------------------------------------------------------------------------
ZAFUL
https://zadaa.co/de-en/products/women/clothes-dresses/zaful/3735781/
--------------------------------------------------------------------------------
NKD
https://zadaa.co/de-en/products/women/clothes-dresses/nkd/3735768/
--------------------------------------------------------------------------------
GREAT RUMORS
https://zadaa.co/de-en/products/women/clothes-dresses/great-rumors/3735762/
--------------------------------------------------------------------------------
...and so on.
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.