Issue
I'm trying to webscrape a page with about 20 articles, but for some reason the spider is only finding the information needed for the very first article. How do I make it scrape every article on the page?
I've tried changing the xpaths multiple times, but I think that I'm too new to this to be sure what the issue is. When I take all the paths out of the for loop it scraps everything well, but its not in a format that allows me to transfer the data to a csv file.
import scrapy
class AfgSpider(scrapy.Spider):
name = 'afg'
allowed_domains = ['www.pajhwok.com/en']
start_urls = ['https://www.pajhwok.com/en/security-crime']
def parse(self, response):
container = response.xpath("//div[@id='taxonomy-page-block']")
for x in container:
title = x.xpath(".//h2[@class='node-title']/a/text()").get()
author = x.xpath(".//div[@class='field-item even']/a/text()").get()
rel_url = x.xpath(".//h2[@class='node-title']/a/@href").get()
yield{
'title' : title,
'author' : author,
'rel_url' : rel_url
}
Solution
Nice answer provided by @Roman. Another options to fix your script :
.Declaring the right XPath for your loop step :
container = response.xpath("//div[@class='node-inner clearfix']")
. Or, remove your loop step and use .getall()
method to fetch the data :
title = response.xpath(".//h2[@class='node-title']/a/text()").getall()
author = response.xpath(".//div[@class='field-item even']/a/text()").getall()
rel_url = response.xpath(".//h2[@class='node-title']/a/@href").getall()
Answered By - E.Wiest
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.