Saturday, May 14, 2022

[FIXED] 403 HTTP status code is not handled or not allowed

May 14, 2022 python, scrapy, web-scraping No comments

Issue

i am trying to get a list of the locations from https://www.taylorwimpey.co.uk/sitemap. it opens in my browser fine but when i try and use scrapy i get nothing and:

2022-04-30 11:49:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-30 11:49:22 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.taylorwimpey.co.uk/sitemap> (referer: None)
2022-04-30 11:49:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.taylorwimpey.co.uk/sitemap>: HTTP status code is not handled or not allowed
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Closing spider (finished)
Starting csv blank line cleaning
2022-04-30 11:49:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 233,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 2020,
 'downloader/response_count': 1,
 'downloader/response_status_count/403': 1,
 'elapsed_time_seconds': 2.297067,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 30, 10, 49, 22, 111984),
 'httpcompression/response_bytes': 3932,
 'httpcompression/response_count': 1,
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/403': 1,
 'log_count/DEBUG': 6,
 'log_count/INFO': 11,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 30, 10, 49, 19, 814917)}
2022-04-30 11:49:22 [scrapy.core.engine] INFO: Spider closed (finished)

i have tried making adjustments in setting/py such as changing the User Agent but not working so far.

my code is:

import scrapy

from TaylorWimpey.items import TaylorwimpeyItem

from scrapy.http import TextResponse
from selenium import webdriver

class taylorwimpeySpider(scrapy.Spider):
 
    name = "taylorwimpey"
    allowed_domains = ["taylorwimpey.co.uk"]

    start_urls = ["https://www.taylorwimpey.co.uk/sitemap"]

    def __init__(self):
        try:
            self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
        except:
            self.driver = webdriver.Chrome("C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe")       
    

    def parse(self, response): # build a list of all locations
        self.driver.get(response.url)
        response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
        
        url_list1 = []
        
        for href in response1.xpath('//div[@class="content-container"]/ul/li/a/@href'):
            url = response1.urljoin(href.extract())
            url_list1.append(url)
            print(url)

any views on what to do?

Solution

You are getting 403 because the website is in CloudFlare protection.

https://www.taylorwimpey.co.uk/sitemap could be using a CNAME configuration

https://www.taylorwimpey.co.uk/sitemap is using Cloudflare CDN/Proxy!

And Scrapy with Selenium can't handle it. But Selenium on its own can handle such cases and overcome the protection smoothly.

import time
import pandas as pd 
# selenium 4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

#options to add as arguments
from selenium.webdriver.chrome.options import Options
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")

#chrome to stay open
option.add_experimental_option("detach", True)

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=option)
driver.get('https://www.taylorwimpey.co.uk/sitemap')
time.sleep(2)
URL=[]
for url in driver.find_elements(By.XPATH,'//*[@class="content-container"]/ul/li/a'):
    url=url.get_attribute('href')
    URL.append(url)
    #print(url)
df = pd.DataFrame(URL,columns=['Links'])
print(df)

Output:

                              Links
0     https://www.taylorwimpey.co.uk/new-homes/abera...
1     https://www.taylorwimpey.co.uk/new-homes/aberarth
2     https://www.taylorwimpey.co.uk/new-homes/aberavon
3     https://www.taylorwimpey.co.uk/new-homes/aberdare
4     https://www.taylorwimpey.co.uk/new-homes/aberdeen
...                                                 ...
1691   https://www.taylorwimpey.co.uk/new-homes/yateley
1692  https://www.taylorwimpey.co.uk/new-homes/yealm...
1693    https://www.taylorwimpey.co.uk/new-homes/yeovil
1694      https://www.taylorwimpey.co.uk/new-homes/york
1695  https://www.taylorwimpey.co.uk/new-homes/ystra...

[1696 rows x 1 columns]

chromedriverManager

Answered By - F.Hoque

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, May 14, 2022

[FIXED] 403 HTTP status code is not handled or not allowed

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels