Issue
I am try to extract some quotes from here using Scrapy but I am running into some kind of problem. here is my code.
import scrapy
start_urls=['https://www.goodreads.com/quotes']
for number in range(1,11):
start_urls.append('https://www.goodreads.com/{}'.format(str(number)))
class quotes(scrapy.Spider):
name='goodreads_quotes'
def start_requests(self):
urls=start_urls
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self,response):
quotes=response.css('div .quoteText::text').extract()
for quote in quotes:
if len(quote)>10:
yield quote
Every time I try to run it in scrapy shell, I get the following error
2020-10-16 21:53:16 [scrapy.core.engine] INFO: Spider opened
2020-10-16 21:53:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0
items (at 0 items/min)
2020-10-16 21:53:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-16 21:53:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://www.goodreads.com/robots.txt> (referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.goodreads.com/quotes>
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/7>
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/2>
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/5>
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/3>
(referer: None)
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str'
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/6>
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/4>
(referer: None)
2020-10-16 21:53:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.goodreads.com/1>
(referer: None)
2020-10-16 21:53:19 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404
https://www.goodreads.com/7>: HTTP status code is not handled or not allowed
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str'
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str'
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str'
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str'
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str'
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.core.scraper] ERROR: Spider must return request, item, or None, got 'str'
in <GET https://www.goodreads.com/quotes>
2020-10-16 21:53:19 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404
https://www.goodreads.com/9>: HTTP status code is not handled or not allowed
2020-10-16 21:53:19 [scrapy.core.engine] INFO: Closing spider (finished)
Does anybody have any suggestions, that can help me successfully scrape the site?
Solution
As the error points out, a parse
function must return a request
, item
, or None
. It errors because you are trying to return a str
. Instead of returning a str
, you can solve this by creating a class that inherits from scrapy.Item
and holds the data you want:
# Create a scrapy.Item class which will hold all the scraped data
class Quote(scrapy.Item):
text = scrapy.Field()
# any additional info you want to put in a quote...
class QuoteSpider(scrapy.Spider):
...
def parse(self, response):
quotes = response.css('div .quoteText::text').extract()
for quote in quotes:
if len(quote) > 10:
# We return a Quote scrapy.Item instead of a string!
yield Quote(text=quote)
Answered By - Cho'Gath
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.