Wednesday, October 20, 2021

[FIXED] Scrapy response 403 set request.dont_filter False

October 20, 2021 scrapy, web-scraping No comments

Issue

im currently scraping https://www.carsales.com.au/cars/results

this site use a cookie ('datadome') that expires given some time, then after all requests responses are 403 until it stops. currently using JOB_DIR in setting.py for persistent data between crawls. Once I update the cookie start the crawler again but 403s pages are omitted because of duplicate requests already done to the site.

Is there a way to set dont_filter once i get the response?

ive tried the following using download middleware with no luck.

def process_response(self, request, response, spider):

    #if response.status == 403:
    #    print(request.url,"expired cookie")
    #    request.dont_filter=True

    return response

Manipulate requests seen url seem an option too but i dont find any hint on how to use it.

Thanks in advance.

Solution

I'm not sure I understand your use case but to answer your question: you can reschedule a request in downloader middleware. Make sure it's priority is high in your settings and in process_response return a new modified request:

def process_response(self, request, response, spider):
    if response.status == 403:
        print(request.url,"expired cookie")
        request.dont_filter=True
        return request
    return response

As per documentation if process_response returns a request it will be rescheduled however if you return response it will continue to be processed through the middlewares and returned to your callback.

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_response

If it returns a Response (it could be the same given response, or a brand-new one), that response will continue to be processed with the process_response() of the next middleware in the chain.

If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().

Answered By - Granitosaurus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 20, 2021

[FIXED] Scrapy response 403 set request.dont_filter False

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels