Sunday, January 30, 2022

[FIXED] Scrapy Xpath extract h6 and ul/li selector in same loop

January 30, 2022 python, python-3.x, scrapy No comments

Issue

I'm new to Scrapy but I'm running into an issue forming an accurate selector based on scrapy's tutorial code basically I'm trying to extract all offices within a states state directory and in order to determine which office belongs to what branch of government I need (I think) the what's inside the h6 tag, but also the ul/li elements descending from each one:

This code works fine, and I can save the output of each office to a json for processing later, however it doesn't have the branch above it just an empty space.

class NewSpider(scrapy.Spider):
    name = 'Wyoming'
    start_urls = [
            'http://www.wyo.gov/agencies'
    ]

    def parse(self, response):

        for sel in response.xpath('//ul/li'):
            yield {
                    "Text"     : sel.xpath('a/text()').get(),
                    "Link"     : sel.xpath('a/@href').get(),
            }

But (and this is where my inexperience shows) when I adjust it to capture the list header:

class NewSpider(scrapy.Spider):
    name = 'Wyoming'
    start_urls = [
            'http://www.wyo.gov/agencies'
    ]

    def parse(self, response):

        for sel in response.xpath('//h6/ul/li'):
            yield {
                    "Hierarchy": sel.xpath('a/name').get(),
                    "Text"     : sel.xpath('a/text()').get(),
                    "Link"     : sel.xpath('a/@href').get(),
            }

I'm currently using this cheat sheet and generally reading up on xpath now since I've read that it's super powerful. But I'm generally kind of confused on how to format the syntax. Please let me know if there is anything I can provide!

Solution

The issue is that h6 is not a parent element of ul, but it's sibiling. So the best approach in my opinion would be:

def parse(self, response):

    for unordered_list in response.xpath('//ul[preceding-sibling::h6]'):
        list_header = unordered_list.xpath('preceding-sibling::h6[1]//font/text()').get()
        rows = unordered_list.xpath('li')
        for sel in rows :
            yield {
                "Hierarchy": list_header,
                "Text"     : sel.xpath('a/text()').get(),
                "Link"     : sel.xpath('a/@href').get(),
            }

Edited: My previous XPath was selecting all ul for each header. Due to some inconsitencies in the page's html I changed the the selectors to first select the ul and then find it's previous h6 tag that contained it's header. This should work correctly now.

Answered By - renatodvc

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] Scrapy Xpath extract h6 and ul/li selector in same loop

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels