Thursday, June 30, 2022

[FIXED] Scrapy concurrent spiders instance variables

June 30, 2022 scrapy No comments

Issue

I have a number of Scrapy spiders running and recently had a strange bug. I have a base class and a number of sub classes:

class MyBaseSpider(scrapy.Spider):
    new_items = []

    def spider_closed(self):
        #Email any new items that weren't in the last run

class MySpiderImpl1(MyBaseSpider):
    def parse(self):
        #Implement site specific checks
        self.new_items.append(new_found_item)

class MySpiderImpl2(MyBaseSpider):
    def parse(self):
        #Implement site specific checks
        self.new_items.append(new_found_item)

This seems to have been running well, new items get emailed to me on a per-site basis. However I've recently had some emails from MySpiderImpl1 which contain items from Site 2.

I'm following the documentation to run from a script:

    scraper_settings = get_project_settings()
    runner = CrawlerRunner(scraper_settings)
    configure_logging()
    sites = get_spider_names()
    for site in sites:
        runner.crawl(site.spider_name)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    reactor.run()

I suspect the solution here is to switch to a pipeline which collates the items for a site and emails them out when pipeline.close_spider is called but I was surprised to see the new_items variable leaking between spiders.

Is there any documentation on concurrent runs? Is it bad practice to keep variables on a base class? I do also track other pieces of information on the spiders in variables such as the run number - should this be tracked elsewhere?

Solution

In python all class variables are shared between all instances and subclasses. So your MyBaseSpider.new_items is the exact same list that is used by MySpiderImpl1.new_items and MySpiderImpl2.new_items.

As you suggested you could implement a pipeline, although this might require significantly refactoring your current code. It could look something like this.

pipelines.py

class MyPipeline:
    def process_item(self, item, spider):
        if spider.name == 'site1':
            ... email item
        elif spider.name == 'site2':
            ... do something different

I am assuming all of your spiders have names... I think it's a requirement.

Another option that probably requires less effort might be to override the start_requests method in your base class to assign a unique list at start of the crawling process.

Answered By - alexpdev

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, June 30, 2022

[FIXED] Scrapy concurrent spiders instance variables

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels