Issue
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Jptimes3Spider(CrawlSpider):
name = 'jptimes3'
allowed_domains = ['japantimes.co.jp']
start_urls = ['https://www.japantimes.co.jp/']
custom_settings = {
'DOWNLOAD_DELAY' : 3,
}
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[@id="page"]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield{
'category' : response.css('h3 > span.category-column::text').getall(),
'category2' : response.css('h3.category-column::text').getall(),
'article title' : response.css('p.article-title::text').getall(),
'summary' : response.xpath('//*[@id="wrapper"]/section[2]/div[1]/section[4]/div/ul/li[4]/a/article/header/hgroup/p/text()').getall()
}
I'm new to scrapy and this is my first crawl spider. I'm having 2 issues. The first is that it will get the links but not scrape any items I just get the column headers made in my csv. Also I was wondering if there was a way to grab the same data ie categories for instance in the same column if the have different css/xpaths?
Solution
xpath selection in rules and in parse_items method was incorrect.Here is an example of working solution.
script:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class Jptimes3Spider(CrawlSpider):
name = 'jptimes3'
allowed_domains = ['japantimes.co.jp']
start_urls = ['https://www.japantimes.co.jp']
custom_settings = {
'DOWNLOAD_DELAY': 3,
}
rules = (Rule(LinkExtractor(restrict_xpaths='//div[@data-tb-region="Top News"]/a'), callback='parse_item', follow=True),)
def parse_item(self, response):
yield {
'category': response.xpath('//h3[@class="single-post-categories"]/a/text()').get(),
'article title': ''.join(response.xpath('//h1/text()').getall())
}
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.