Issue
i am trying to get a list of the locations from https://www.taylorwimpey.co.uk/sitemap. it opens in my browser fine but when i try and use scrapy i get nothing and:
2022-04-30 11:49:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-30 11:49:22 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.taylorwimpey.co.uk/sitemap> (referer: None)
2022-04-30 11:49:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.taylorwimpey.co.uk/sitemap>: HTTP status code is not handled or not allowed
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Closing spider (finished)
Starting csv blank line cleaning
2022-04-30 11:49:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 233,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2020,
'downloader/response_count': 1,
'downloader/response_status_count/403': 1,
'elapsed_time_seconds': 2.297067,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 4, 30, 10, 49, 22, 111984),
'httpcompression/response_bytes': 3932,
'httpcompression/response_count': 1,
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/403': 1,
'log_count/DEBUG': 6,
'log_count/INFO': 11,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 4, 30, 10, 49, 19, 814917)}
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Spider closed (finished)
i have tried making adjustments in setting/py such as changing the User Agent but not working so far.
my code is:
import scrapy
from TaylorWimpey.items import TaylorwimpeyItem
from scrapy.http import TextResponse
from selenium import webdriver
class taylorwimpeySpider(scrapy.Spider):
name = "taylorwimpey"
allowed_domains = ["taylorwimpey.co.uk"]
start_urls = ["https://www.taylorwimpey.co.uk/sitemap"]
def __init__(self):
try:
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
except:
self.driver = webdriver.Chrome("C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe")
def parse(self, response): # build a list of all locations
self.driver.get(response.url)
response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
url_list1 = []
for href in response1.xpath('//div[@class="content-container"]/ul/li/a/@href'):
url = response1.urljoin(href.extract())
url_list1.append(url)
print(url)
any views on what to do?
Solution
You are getting 403 because the website is in CloudFlare protection.
https://www.taylorwimpey.co.uk/sitemap could be using a CNAME configuration
https://www.taylorwimpey.co.uk/sitemap is using Cloudflare CDN/Proxy!
And Scrapy with Selenium can't handle it. But Selenium on its own can handle such cases and overcome the protection smoothly.
import time
import pandas as pd
# selenium 4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
#options to add as arguments
from selenium.webdriver.chrome.options import Options
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
#chrome to stay open
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.taylorwimpey.co.uk/sitemap')
time.sleep(2)
URL=[]
for url in driver.find_elements(By.XPATH,'//*[@class="content-container"]/ul/li/a'):
url=url.get_attribute('href')
URL.append(url)
#print(url)
df = pd.DataFrame(URL,columns=['Links'])
print(df)
Output:
Links
0 https://www.taylorwimpey.co.uk/new-homes/abera...
1 https://www.taylorwimpey.co.uk/new-homes/aberarth
2 https://www.taylorwimpey.co.uk/new-homes/aberavon
3 https://www.taylorwimpey.co.uk/new-homes/aberdare
4 https://www.taylorwimpey.co.uk/new-homes/aberdeen
... ...
1691 https://www.taylorwimpey.co.uk/new-homes/yateley
1692 https://www.taylorwimpey.co.uk/new-homes/yealm...
1693 https://www.taylorwimpey.co.uk/new-homes/yeovil
1694 https://www.taylorwimpey.co.uk/new-homes/york
1695 https://www.taylorwimpey.co.uk/new-homes/ystra...
[1696 rows x 1 columns]
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.