Issue
I am crawling a web with scrapy and I receive the error:
Gave up retrying <GET https://www.something.net> (failed 3 times): 500 Internal Server Error
even though in the parse method I have added this parameter to the meta of the scrapy.Request
that calls the parse function:
"handle_httpstatus_all": True,
Then in the parse function I do:
item = response.meta['item']
if response.status == 200:
#Keeps building the item
yield item
So in theory this should not happen. What can I do to avoid it?
Solution
Your theory is missing some vital information.
Scrapy has two different sets of middleware that each request must pass through. The one you are referring to is the HttpErrorMiddleware
which is belongs to the Spider-Middleware
group. If this middleware is enabled and you set the request meta key handle_httpstatus_all
to True
then it does in fact allow all failed requests through to be parsed.
However there is another group of Middleware called the Downloader-Middleware
which are passed through first before the requests/responses ever reach the Spider-Middleware
. Among these is the RetryMiddleware
which identifies responses with certain error codes that are determined to be potentially temporary and automatically resends those requests up to a certain number of times before the response is officially considered failed.
So your theory is still accurate in the sense that all failed responses are allowed to go through, but for some error codes they first go through a few retry attempts before they get processed.
You can customize the middleware's behavior by setting the number of retries with max_retry_times
meta key to a custom value, or setting the dont_retry
meta key to True
, or you can disable the retry middleware altogether in your settings with RETRY_ENABLED = False
.
You can also customize which error codes are considered eligible for retry with the RETRY_HTTP_CODES
setting.
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.