Issue
im currently scraping https://www.carsales.com.au/cars/results
this site use a cookie ('datadome') that expires given some time, then after all requests responses are 403 until it stops. currently using JOB_DIR in setting.py for persistent data between crawls. Once I update the cookie start the crawler again but 403s pages are omitted because of duplicate requests already done to the site.
Is there a way to set dont_filter once i get the response?
ive tried the following using download middleware with no luck.
def process_response(self, request, response, spider):
#if response.status == 403:
# print(request.url,"expired cookie")
# request.dont_filter=True
return response
Manipulate requests seen url seem an option too but i dont find any hint on how to use it.
Thanks in advance.
Solution
I'm not sure I understand your use case but to answer your question: you can reschedule a request in downloader middleware. Make sure it's priority is high in your settings and in process_response
return a new modified request:
def process_response(self, request, response, spider):
if response.status == 403:
print(request.url,"expired cookie")
request.dont_filter=True
return request
return response
As per documentation if process_response
returns a request it will be rescheduled however if you return response it will continue to be processed through the middlewares and returned to your callback.
If it returns a Response (it could be the same given response, or a brand-new one), that response will continue to be processed with the process_response() of the next middleware in the chain.
If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().
Answered By - Granitosaurus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.