Issue
I am trying my best to search for a setting on the Scrapy spider when the following condition occurs.
- In the middle of my scraping activity If I have a power failure
- my ISP goes down
and the behavior i am expecting is Scrapy should not give up. rather wait infinitely for power to be restored and continue scraping by retrying the requests after a brief pause or interval of 10secs.
This is the error message that i get when my internet goes off.
https://example.com/1.html
2022-10-21 17:44:14 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying
<GET https://www.example.com/1.html
(failed 1 times): An error occurred while connecting: 10065: A socket operation was attempted to an unreachable host..
And the message repeats.
what i am afraid is when the blip is restored scrapy would have given up trying 1.html and might have gone to another url called 99.html.
My question is when the error socket operation to an unreachable host occurs how to make scrapy wait and retry the same url https://www.example.com/1.html
Thanks in advance.
Solution
There is no built in setting that will do this, however this can still be implemented rather easily.
The way that seems the most straight forward to me would be to catch the response_received
signal in your spider and check for the specific error code you receive when your ISP goes down. When this happens you can pause the scrapy engine and wait for any amount of time you want and then retry that same request again, until it succeeds.
for example:
from scrapy import Spider
from scrapy.signals import response_received
class MySpider(Spider):
...
...
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
# listen for the response_received signal and call check_response
crawler.signals.connect(spider.check_response, signal=response_received)
return spider
def check_response(self, response, request, spider):
engine = spider.crawler.engine
if response.status == 404: # <- your error code goes here
engine.pause()
time.sleep(6000) # <- wait 10 minutes
request.dont_filter = True # <- tell engine not to filter
engine.unpause()
engine.crawl(request.copy()) # <- resend the request
Update
Since it isn't an http error code you are receiving the next best solution would be to create a custom DownloaderMiddleware
that catches the exceptions and then pretty much does the same thing that is done in the first example.
In your middlewares.py
file:
import time
from twisted.internet.error import (ConnectError, ConnectionLost
TimeoutError, DNSLookupError,
ConnectionRefusedError)
class ConnectionLostPauseDownloadMiddleware:
def __init__(self, settings, crawler):
self.crawler = crawler
self.exceptions = (ConnectionRefusedError, ConnectionDone, ConnectError, ConnectionLost)
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings, crawler)
def process_exception(self, request, exception, spider):
if isinstance(exception, self.exceptions):
new_request = request.copy()
new_request.dont_filter = True
self.crawler.engine.pause()
time.sleep(60 * 10)
self.crawler.engine.unpause()
return new_request
Then in your settings.py
DOWNLOADER_MIDDLEWARES = {
'MyProjectName.middlewares.ConnectionLostPauseDownloadMiddleware': 543,
}
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.