Friday, November 26, 2021

[FIXED] Running multiple spiders in scrapy - spider not found

November 26, 2021 python, scrapy No comments

Issue

As the title suggests, I'm trying to use multiple spiders in scrapy. One spider, news_spider works using the command

scrapy crawl news_spider -o news.json. It produces the exact result I expect.

However, when I try to use the spider quotes_spider using the following command

scrapy crawl quotes_spider -o quotes.json

I receive the following message, "Spider not found: quotes_spider"

And just for some history, I created quotes_spider first and it was working. I then duplicated it as news_spider and edited, at which time I moved quotes_spider out of spiders directory. Now that I have news_spider working, I moved quotes_spider back in to spiders directory and got the above ERROR message.

The directory tree looks like this

tutorial
├── news.json
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── __pycache__
    │   ├── __init__.cpython-37.pyc
    │   ├── items.cpython-37.pyc
    │   └── settings.cpython-37.pyc
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── quotes.jl
    ├── quotes.json
    ├── settings.py
    └── spiders
        ├── __init__.py
        ├── __pycache__
        │   ├── __init__.cpython-37.pyc
        │   ├── news_spider.cpython-37.pyc
        │   └── quotes_spider.cpython-37.pyc
        ├── news_spider.py
        └── quotes_spider.py

News Spider:

from scrapy.exporters import JsonLinesItemExporter
from tutorial.items import TutorialItem

# Scrapy Spider
class FinNewsSpider(scrapy.Spider):
    # Initializing log file
    # logfile("news_spider.log", maxBytes=1e6, backupCount=3)
    name = "news_spider"
    allowed_domains = ['benzinga.com/']
    start_urls = [
        'https://www.benzinga.com/top-stories/20/09/17554548/stock-wars-ford-vs-general-motors-vs-tesla'
    ]

# MY SCRAPY STUFF
# response.xpath('//div[@class="article-content-body-only"]/p/text()').extract()
    def parse(self, response):
        paragraphs = response.xpath('//div[@class="article-content-body-only"]/p/text()').extract()
        print(paragraphs)
        for p in paragraphs:
            yield TutorialItem(content=p)

Quotes Spider:

from scrapy.exporters import JsonLinesItemExporter

class QuotesSpider(scrapy.Spider):
    name = "quotes"

#### Actually don't have to use the start_requests function since it's built in. Can just use start_urls
    # def start_requests(self):
    #     urls = [
    #         'http://quotes.toscrape.com/page/1/',
    #         'http://quotes.toscrape.com/page/2/'
    #     ]
    #     for url in urls:
    #         yield scrapy.Request(url=url, callback=self.parse)
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/'
    ]

#### Original parse to just get the entire page
    # def parse(self, response):
    #     page = response.url.split("/")[-2]
    #     filename = 'quotes-%s.html' % page
    #     with open(filename, 'wb') as f:
    #         f.write(response.body)
    #     self.log('Saved file %s' % filename)

#### Parse to actually gather targeted info
    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                'text': quote.css("span.text::text").get(),
                'author': quote.css("small.author::text").get(),
                'tags': quote.css("div.tags a.tag::text").getall()
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

I have searched SO and the answers I found regarding multiple spiders all seem related to running multiple spiders concurrently, which is not what I'm trying to do, so I have not found an answer to why one of these works and one does not. Can anyone see an error in my code that I might be overlooking?

Solution

The problem is how you are executing it. The name of you quotes spider is "quotes" not "quotes_spider"

class QuotesSpider(scrapy.Spider):
    name = "quotes"

Therefore the command to run it is :

scrapy crawl quotes -o quotes.json

Just like the name of your news spider is "news_spider"

class FinNewsSpider(scrapy.Spider):
    # Initializing log file
    # logfile("news_spider.log", maxBytes=1e6, backupCount=3)
    name = "news_spider"

And you execute it with

scrapy crawl news_spider -o news.json

Answered By - renatodvc

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 26, 2021

[FIXED] Running multiple spiders in scrapy - spider not found

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels