Issue
As the title suggests, I'm trying to use multiple spiders in scrapy. One spider, news_spider works using the command
scrapy crawl news_spider -o news.json
. It produces the exact result I expect.
However, when I try to use the spider quotes_spider using the following command
scrapy crawl quotes_spider -o quotes.json
I receive the following message, "Spider not found: quotes_spider"
And just for some history, I created quotes_spider first and it was working. I then duplicated it as news_spider and edited, at which time I moved quotes_spider out of spiders directory. Now that I have news_spider working, I moved quotes_spider back in to spiders directory and got the above ERROR message.
The directory tree looks like this
tutorial
├── news.json
├── scrapy.cfg
└── tutorial
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-37.pyc
│ ├── items.cpython-37.pyc
│ └── settings.cpython-37.pyc
├── items.py
├── middlewares.py
├── pipelines.py
├── quotes.jl
├── quotes.json
├── settings.py
└── spiders
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-37.pyc
│ ├── news_spider.cpython-37.pyc
│ └── quotes_spider.cpython-37.pyc
├── news_spider.py
└── quotes_spider.py
News Spider:
from scrapy.exporters import JsonLinesItemExporter
from tutorial.items import TutorialItem
# Scrapy Spider
class FinNewsSpider(scrapy.Spider):
# Initializing log file
# logfile("news_spider.log", maxBytes=1e6, backupCount=3)
name = "news_spider"
allowed_domains = ['benzinga.com/']
start_urls = [
'https://www.benzinga.com/top-stories/20/09/17554548/stock-wars-ford-vs-general-motors-vs-tesla'
]
# MY SCRAPY STUFF
# response.xpath('//div[@class="article-content-body-only"]/p/text()').extract()
def parse(self, response):
paragraphs = response.xpath('//div[@class="article-content-body-only"]/p/text()').extract()
print(paragraphs)
for p in paragraphs:
yield TutorialItem(content=p)
Quotes Spider:
from scrapy.exporters import JsonLinesItemExporter
class QuotesSpider(scrapy.Spider):
name = "quotes"
#### Actually don't have to use the start_requests function since it's built in. Can just use start_urls
# def start_requests(self):
# urls = [
# 'http://quotes.toscrape.com/page/1/',
# 'http://quotes.toscrape.com/page/2/'
# ]
# for url in urls:
# yield scrapy.Request(url=url, callback=self.parse)
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/'
]
#### Original parse to just get the entire page
# def parse(self, response):
# page = response.url.split("/")[-2]
# filename = 'quotes-%s.html' % page
# with open(filename, 'wb') as f:
# f.write(response.body)
# self.log('Saved file %s' % filename)
#### Parse to actually gather targeted info
def parse(self, response):
for quote in response.css("div.quote"):
yield {
'text': quote.css("span.text::text").get(),
'author': quote.css("small.author::text").get(),
'tags': quote.css("div.tags a.tag::text").getall()
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
I have searched SO and the answers I found regarding multiple spiders all seem related to running multiple spiders concurrently, which is not what I'm trying to do, so I have not found an answer to why one of these works and one does not. Can anyone see an error in my code that I might be overlooking?
Solution
The problem is how you are executing it. The name of you quotes spider is "quotes" not "quotes_spider"
class QuotesSpider(scrapy.Spider):
name = "quotes"
Therefore the command to run it is :
scrapy crawl quotes -o quotes.json
Just like the name of your news spider is "news_spider"
class FinNewsSpider(scrapy.Spider):
# Initializing log file
# logfile("news_spider.log", maxBytes=1e6, backupCount=3)
name = "news_spider"
And you execute it with
scrapy crawl news_spider -o news.json
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.