Issue
I'm having some problems with the web crawler I wrote. I want to save the data that I fetch. If I understood right from the scrapy tutorial I just need to yield it and then start the crawler by using scrapy crawl <crawler> -o file.csv -t csv
right? For some reason the file remains empty. Here's my code:
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class PaginebiancheSpider(CrawlSpider):
name = 'paginebianche'
allowed_domains = ['paginebianche.it']
start_urls = ['https://www.paginebianche.it/aziende-clienti/lombardia/milano/comuni.htm']
rules = (
Rule(LinkExtractor(allow=(), restrict_css = ('.seo-list-name','.seo-list-name-up')),
callback = "parse_item",
follow = True),)
def parse_item(self, response):
if(response.xpath("//h2[@class='rgs']//strong//text()") != [] and response.xpath("//span[@class='value'][@itemprop='telephone']//text()") != []):
yield ' '.join(response.xpath("//h2[@class='rgs']//strong//text()").extract()) + " " + response.xpath("//span[@class='value'][@itemprop='telephone']//text()").extract()[0].strip(),
I'm using python 2.7
Solution
If you look at the spider's output, you will see a bunch of error messages like this one being logged:
2018-10-20 13:47:52 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'tuple' in <GET https://www.paginebianche.it/lombardia/abbiategrasso/vivai-padovani.html>
What this means is that you're not yielding the correct thing - you need dicts or Item
s, and not the single-item tuples you're creating.
Something as simple as this should work:
yield {
'name': response.xpath("normalize-space(//h2[@class='rgs'])").get(),
'phone': response.xpath("//span[@itemprop='telephone']/text()").get()
}
Answered By - stranac
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.