Issue
I'm new to Scrapy but I'm running into an issue forming an accurate selector based on scrapy's tutorial code basically I'm trying to extract all offices within a states state directory and in order to determine which office belongs to what branch of government I need (I think) the what's inside the h6 tag, but also the ul/li elements descending from each one:
This code works fine, and I can save the output of each office to a json for processing later, however it doesn't have the branch above it just an empty space.
class NewSpider(scrapy.Spider):
name = 'Wyoming'
start_urls = [
'http://www.wyo.gov/agencies'
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
yield {
"Text" : sel.xpath('a/text()').get(),
"Link" : sel.xpath('a/@href').get(),
}
But (and this is where my inexperience shows) when I adjust it to capture the list header:
class NewSpider(scrapy.Spider):
name = 'Wyoming'
start_urls = [
'http://www.wyo.gov/agencies'
]
def parse(self, response):
for sel in response.xpath('//h6/ul/li'):
yield {
"Hierarchy": sel.xpath('a/name').get(),
"Text" : sel.xpath('a/text()').get(),
"Link" : sel.xpath('a/@href').get(),
}
I'm currently using this cheat sheet and generally reading up on xpath now since I've read that it's super powerful. But I'm generally kind of confused on how to format the syntax. Please let me know if there is anything I can provide!
Solution
The issue is that h6
is not a parent element of ul
, but it's sibiling. So the best approach in my opinion would be:
def parse(self, response):
for unordered_list in response.xpath('//ul[preceding-sibling::h6]'):
list_header = unordered_list.xpath('preceding-sibling::h6[1]//font/text()').get()
rows = unordered_list.xpath('li')
for sel in rows :
yield {
"Hierarchy": list_header,
"Text" : sel.xpath('a/text()').get(),
"Link" : sel.xpath('a/@href').get(),
}
Edited:
My previous XPath was selecting all ul
for each header. Due to some inconsitencies in the page's html I changed the the selectors to first select the ul
and then find it's previous h6
tag that contained it's header. This should work correctly now.
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.