Wednesday, July 20, 2022

[FIXED] how to solve double 403 response in web scraping with scrapy

July 20, 2022 header, python, scrapy No comments

Issue

I try to get the articles from this website. What I have tried:

get into the main url
get into the sub url where the complete article is there
get all the details I need from the complete article

But I got response 403 when I tried to run my code first, then I tried to fix it by adding headers when requesting to the start_urls as what I read from some answers. I did it, but then my script gives me error where it said response 403 when getting into the sub url where all the information I need is there.

My current code is below

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess


class climateupdate(scrapy.Spider):
    name = 'climateupdate'
    start_urls = ['http://www.bom.gov.au/climate/updates/']

    def start_requests(self):
        headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
        for url in self.start_urls:
            yield Request(url, headers=headers)

    def parse(self, response):
        for link in response.xpath('//*[@id="content"]/ul/li[1]/a/@href'):
            yield response.follow(
                url=link.get(),
                callback=self.parse_item
            )
        

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@id="updates"]/p[1]/time/text()').extract(),
            'title': response.xpath('//*[@id="updates"]/div[1]/h1/text()').get(),
            'text':''.join([x.get().strip() for x in response.xpath('//*[@class="key-points box-notice bg-grey"]//p//text()')])
            }
if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(weeklymining)
    process.start()

How should I write my script in order to get into the sub url and get all the details regarding the articles.

Thank you in advance.

Solution

Using headers here is incorrect that's why you are getting 403
Use custom settings to inject user-agent
your date and text selecting xpath expressions are incorrect
//*[@id="content"]/ul/li[1]/a/@href selects only one details url

Full working code:

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess

class climateupdate(scrapy.Spider):
    name = 'climateupdate'
    start_urls = ['http://www.bom.gov.au/climate/updates/']
    custom_settings = {
            'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
            'DOWNLOAD_DELAY': 1,
            'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
            }

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url,callback=self.parse)

    def parse(self, response):
        for link in response.xpath('//*[@class="list-archive"]/li/a/@href'):
            yield response.follow(
                url=link.get(),
                callback=self.parse_item
            )
        

    def parse_item(self, response):
        yield {
            'date': response.xpath('//*[@id="updates"]/p[1]/time/text()').get(),
            'title': ''.join(response.xpath('//*[@id="updates"]//h1//text()').getall()).strip(),
            'text':''.join(response.xpath('//*[@id="updates"]//p//text()').getall()[1:]).strip()
            }
if __name__ == '__main__':
    process = CrawlerProcess()
    process.crawl(weeklymining)
    process.start()

Answered By - F.Hoque

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, July 20, 2022

[FIXED] how to solve double 403 response in web scraping with scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels