Tuesday, January 25, 2022

[FIXED] How to exclude urls already scraped while doing scraping using scrapy framework

January 25, 2022 python, scrapy, web-scraping No comments

Issue

I am scraping a news website which extracts news data and dumps it to MongoDB.

My spider is defined with the following rule:

rules = [Rule(
            LinkExtractor(
                allow=["foo.tv/en/*",
                       "https://fooports.tv/*"]  # only such urls

What I currently do that it fetches already scraped urls from database and does not process those urls if it is found in database eg:

    urls_visited = get_visited_urls() # Fetches from MongoDB
    if response.url not in urls_visited:
        # do scraping here

What I am looking for is there way to make spider to skip those urls which have already been scraped. I want to try to reduce the crawling time by not looking at those urls which have already been processed. I know there is a deny feature in the rule but not sure how can I make use of that in this case.

I have included Downloader Middleware custom class to filter out the requests which has already been scraped :

class NewsCrawlerDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def __init__(self):
        self.urls_visited = get_visited_urls() # from database

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        # Here we check if url has already been scraped,
        # if not process the requests
        if request.url in self.urls_visited:
            logging.info('ignoring url %s', request.url)
            raise IgnoreRequest()
        else:
            return request

My middleware order in settings.py

['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'news_crawler.middlewares.NewsCrawlerDownloaderMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']

however when it tries to crawl first url it gives me the following error:

ERROR: Error downloading <GET https://arynews.tv/robots.txt>: maximum recursion depth exceeded while calling a Python object

Any ideas how can I properly use my custom downloader middleware to to filter out the urls.

Solution

You can create a downloader middleware, that will perform filtering of requests based on you database queries. Check out documentation.

In this case you need to define class with process_request(request, spider) method and enable this middleware in your settings (depends how you launch your spider - via cli or within python script).

Alternatively you can define your own duplication filter, take a look at dupefilters.py. But this might be a bit more complicated approach, as you need to have some understanding and experience with scrapy.

Answered By - Serhii Shynkarenko

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 25, 2022

[FIXED] How to exclude urls already scraped while doing scraping using scrapy framework

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels