Issue
I'm working with the CrawlSpider class to crawl a website and I would like to modify the headers that are sent in each request. Specifically, I would like to add the referer to the request.
As per this question, I checked
response.request.headers.get('Referer', None)
in my response parsing function and the Referer
header is not present. I assume that means the Referer is not being submitted in the request (unless the website doesn't return it, I'm not sure on that).
I haven't been able to figure out how to modify the headers of a request. Again, my spider is derived from CrawlSpider. Overriding CrawlSpider's _requests_to_follow
or specifying a process_request
callback for a rule will not work because the referer is not in scope at those times.
Does anyone know how to modify request headers dynamically?
Solution
You have to enable the SpiderMiddleware
that will populate the referer
for responses. See the documentation for scrapy.contrib.spidermiddleware.referer.RefererMiddleware
In short, you need to add this middleware to your project's settings file.
SPIDER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
}
Then in your response parsing method, you can use, response.request.headers.get('Referrer', None)
, to get the referer.
Answered By - CatShoes
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.