Issue
I've created a script using scrapy to parse the content from a website. The script is doing fine. However, I want that spider to retry when the url being used in the spider gets redirected (leading to some captcha page) and which is why I created a retry middleware.
I tried to understand why this portion or response
is in place within process_response()
in this line return self._retry(request, reason, spider) or response
as I want this very method to retry, not to return response within that block.
This is my current approach:
def _retry(self, request, spider):
check_url = request.url
r = request.copy()
r.dont_filter = True
return r
def process_response(self, request, response, spider):
if ("some_redirected_url" in response.url) and (response.status in RETRY_HTTP_CODES):
return self._retry(request, spider) or response
return response
Solution
In this case return x or y
is a nice little short cut for
if x:
return x
else:
return y
In the standard RetryMiddleware
the _retry
method has two branches
if retries <= retry_times:
...
return retryreq
else:
...
The else
branch doesn't return anything, and if the method reaches the end without returning then None
is returned implicitly. This means that the
return self._retry(request, reason, spider) or response
line evaluates to
return None or response
and as bool(None)
is False
, response
will be returned in this case. If on the other hand the retry_times
hasn't been exceeded, _retry
will return retryreq
which will evaluate True
and that will be returned from process_response
instead.
In your code _retry
always returns a Response
and so the or response
part will never be reached.
Answered By - tomjn
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.