Issue
I am scraping a news website which extracts news data and dumps it to MongoDB.
My spider is defined with the following rule:
rules = [Rule(
LinkExtractor(
allow=["foo.tv/en/*",
"https://fooports.tv/*"] # only such urls
What I currently do that it fetches already scraped urls from database and does not process those urls if it is found in database eg:
urls_visited = get_visited_urls() # Fetches from MongoDB
if response.url not in urls_visited:
# do scraping here
What I am looking for is there way to make spider to skip those urls which have already been scraped. I want to try to reduce the crawling time by not looking at those urls which have already been processed. I know there is a deny feature in the rule but not sure how can I make use of that in this case.
I have included Downloader Middleware custom class to filter out the requests which has already been scraped :
class NewsCrawlerDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
def __init__(self):
self.urls_visited = get_visited_urls() # from database
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
# Here we check if url has already been scraped,
# if not process the requests
if request.url in self.urls_visited:
logging.info('ignoring url %s', request.url)
raise IgnoreRequest()
else:
return request
My middleware order in settings.py
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'news_crawler.middlewares.NewsCrawlerDownloaderMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
however when it tries to crawl first url it gives me the following error:
ERROR: Error downloading <GET https://arynews.tv/robots.txt>: maximum recursion depth exceeded while calling a Python object
Any ideas how can I properly use my custom downloader middleware to to filter out the urls.
Solution
You can create a downloader middleware, that will perform filtering of requests based on you database queries. Check out documentation.
In this case you need to define class with process_request(request, spider)
method and enable this middleware in your settings (depends how you launch your spider - via cli or within python script).
Alternatively you can define your own duplication filter, take a look at dupefilters.py. But this might be a bit more complicated approach, as you need to have some understanding and experience with scrapy.
Answered By - Serhii Shynkarenko
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.