Monday, December 18, 2023

[FIXED] Scrapy crawler, 403 error for crawling south wales courses

December 18, 2023 python, scrapy, web-scraping No comments

Issue

I have been bashing my head against this for a while and figured I would turn it over to the experts of the internet for a bit of aid.

I am trying to use scrapy to crawl a list of courses from the university of south wales (all public information of course). However whenever I do I get met with a 403 that stops me from getting any information.

Here is my spider code:

import scrapy


class CrawlingSpider(scrapy.Spider):
    name = "southwalescrawler"
    start_urls = ["https://www.southwales.ac.uk/courses/"]
    download_delay = 2

    def parse(self, response):
        pass

    def start_requests(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/58.0.3029.110 Safari/537.3',
            'Referer': 'https://www.southwales.ac.uk/'
        }
        cookies = {'cookie_name': 'cookie_value'}
        for url in self.start_urls:
            yield scrapy.Request(url, headers=headers, cookies=cookies, callback=self.parse)

You'll see that I am handling cookies, delaying requests, and applying a User Agent and Referrer. In spite of that here is the result I get:

2023-12-15 11:51:45 [scrapy.core.engine] INFO: Spider opened
2023-12-15 11:51:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-15 11:51:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-12-15 11:51:45 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.southwales.ac.uk/robots.txt> (referer: None)
2023-12-15 11:51:45 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2023-12-15 11:51:48 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.southwales.ac.uk/courses/> (referer: https://www.southwales.ac.uk/)
2023-12-15 11:51:48 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.southwales.ac.uk/courses/>: HTTP status code is not handled or not allowed
2023-12-15 11:51:48 [scrapy.core.engine] INFO: Closing spider (finished)

Solution

I don't know if someone will find this later and hope for a question actually pertaining to how to use scrapy for a site like this, but I managed to solve the issue by dropping scrapy and using Selenium to manually create a web scraper that gets just the course information I was after. It needed to be non-headless to get through the security but at least it's fun to watch execute.

Answered By - Kron

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, December 18, 2023

[FIXED] Scrapy crawler, 403 error for crawling south wales courses

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels