Issue
I'm new to spiders and have this basic code to scrap reedsy. I was able to locate and pull out the elements I need in the scrapy shell but my code is not working. The error message is so large, I'm not sure what exactly the problem is. Any help would be appreciated!
import scrapy
class PublisherSpider(scrapy.Spider):
name = 'mycrawler'
start_urls = ['https://blog.reedsy.com/publishers/african-american/']
def parse(self, response):
for publishers in response.css('div.panel-body'):
yield {
'Publisher': response.css('h3.text-heavy::text').get().replace('\n',''),
'url' : response.css('a.text-blue').attrib['href'],
}
Solution
The problem described in the traceback is NoneType object has no attribute replace
.
This means that the css selector that you call .replace('\n','')
is evaluating to None
, which doesn't have a replace method.
The cause for this error is because your initial selector div.panel-body
is used in more than one way within the page. One way is to denote each of the containers that house the information that you are trying to extract, but the class is additionally used near the footer of the page, which is the part that is giving you the error.
What you can do to avoid this is to evaluate the selector expressions first, then test to see if those evaluated to None
and if not only then should you do any postprocessing and yield the result.
For example:
import scrapy
class PublisherSpider(scrapy.Spider):
name = 'mycrawler'
start_urls = ['https://blog.reedsy.com/publishers/african-american/']
def parse(self, response):
for publishers in response.css('div.panel-body'):
publisher = publishers.css('h3.text-heavy::text').get()
url = publishers.css('a.text-blue::attr(href)').get()
if publisher and url:
yield {"Publisher": publisher.strip(), "url": url}
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.