Tuesday, January 25, 2022

[FIXED] Scrapy CrawlSpider doesn't quit

January 25, 2022 python, python-3.x, scrapy, scrapy-spider, web-crawler No comments

Issue

I have a problem with scrapy Crawlspider: basically, it doesn't quit, as it is supposed to do, if a CloseSpider exception is raised. Below is the code:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.exceptions import CloseSpider
from scrapy.linkextractors import LinkExtractor
import re

class RecursiveSpider(CrawlSpider):

    name = 'recursive_spider'
    start_urls = ['https://www.webiste.com/']

    rules = (
                Rule(LinkExtractor(), callback='parse_item', follow=True),
                )

    miss = 0
    hits = 0

    def quit(self):
        print("ABOUT TO QUIT")
        raise CloseSpider('limits_exceeded')


    def parse_item(self, response):
        item = dict()
        item['url'] = response.url
        item['body'] = '\n'.join(response.xpath('//text()').extract())
        try:
            match = re.search(r"[A-za-z]{0,1}edical[a-z]{2}", response.body_as_unicode()).group(0)
        except:
            match = 'NOTHING'

        print("\n")
        print("\n")
        print("\n")
        print("****************************************INFO****************************************")
        if "string" in item['url']:    
            print(item['url'])
            print(match)
            print(self.hits)
            self.hits += 10
            if self.hits > 10:
                print("HITS EXCEEDED")
                self.quit()
        else:
            self.miss += 1
            print(self.miss)
            if self.miss > 10:
                print("MISS EXCEEDED")
                self.quit()
        print("\n")
        print("\n")
        print("\n")

The problem is that, although I can see it enters in the conditions, and I can see the Eception raised in the log, the crawler continues crawling. I run it with:

scrapy crawl recursive_spider

Solution

I'm gonna guess this is a case of scrapy just taking too long at being shut down rather than actually ignoring the exception. The engine will not exit until it runs through all scheduled/sent requests so I suggest lowering the values of CONCURRENT_REQUESTS/CONCURRENT_REQUESTS_PER_DOMAIN settings to see if that works for you.

Answered By - John Smith

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 25, 2022

[FIXED] Scrapy CrawlSpider doesn't quit

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels