Issue
please check this link https://maroof.sa/businesses.
it is a link for website from which i want to extract links.
for example if you scroll down you would find a name for store "Marwa store" if you click on this card this will redirect you to the store page
now i need to scrap all the links for stores in the page " https://maroof.sa/businesses "
after inspection i found it hidden
i have successful extract the store name but i cant find the link
thanks in advance
import time
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.by import By
from selenium import webdriver
from scrapy import Selector
import csv
driver = webdriver.Chrome()
driver.get(url="https://maroof.sa/businesses")
html = driver.page_source
names = driver.find_elements(By.CSS_SELECTOR , 'div.storeCard')
Solution
It's impossible to get business details from card info, however, it can be build by getting data from request with url part business/search
.
Business link can be built by pattern {url}/details/{id}
where id can be got from response json object items
.
You can get needed response by using Chrome Dev Tools Protocol that is now available in Selenium.
Also site has anti-scrapping mechanism, it doesn't load every time for me, so you need to use proxy / Undetected Selenium / etc. I added some stealth chrome options, but it doesn't help every time to avoid bot detection mechanism (site thinks that I'm a bot even in regular browser, so I think their bot detection is broken).
import json
import time
from selenium import webdriver
options = webdriver.ChromeOptions()
options.set_capability('goog:loggingPrefs', {'performance': 'ALL'})
def enable_stealth():
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--disable-dev-shm-usage')
options.add_experimental_option("useAutomationExtension", False)
options.add_argument("--enable-javascript")
options.add_argument("--enable-cookies")
options.add_argument('--disable-web-security')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
enable_stealth()
driver = webdriver.Chrome(options)
url = "https://maroof.sa/businesses"
driver.get(url)
logs = driver.get_log("performance")
time.sleep(5)
target_url = 'business/search'
def get_links():
for log in logs:
message = log["message"]
if "Network.responseReceived" not in message:
continue
params = json.loads(message)["message"].get("params")
if params is None:
continue
response = params.get("response")
if response is None or target_url not in response["url"]:
continue
body = driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': params["requestId"]})
items = json.loads(body['body'])['items']
for item in items:
link = f"{url}/details/{item['id']}"
print(link)
get_links()
Answered By - Yaroslavm
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.