Sunday, January 23, 2022

XPath

January 23, 2022 scrapy, xpath No comments

Issue

I created the following spider which leads to these issues when running it:

The headline is "cut" — probably due to the <em> tag inside
The location contains spaces and \n

Currently struggling to find a solution for these two remaining issues.

class GitHubSpider(scrapy.Spider):
    name = "github"
    start_urls = [
        "https://github.com/search?p=1&q=React+Django&type=Users",
    ]

    def parse(self, response):
        for github in response.css(".Box-row"):
            yield {
                "github_link": github.css(".mr-1::attr(href)").get(),
                "name": github.css(".mr-1::text").get(),
                "headline": github.css(".mb-1::text").get(),
                "location": github.css(".mr-3:nth-child(1)::text").get(),
            }

Expected result

# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
        'github_link': '/djangofan',
        'name': 'Jon Austen',
        'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, React-Native, and Docker. Focus: Testing, CI, and Micro-Services.',
        'location': 'Portland, OR'
}
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
    'github_link': '/django-wong',
    'name': ' Wong',
    'headline': 'PHP / Node.js / Dart (Flutter) / React Native / Scala',
    'location': 'China'
}
[...]

Actual Result

# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
        'github_link': '/djangofan',
        'name': 'Jon Austen',
        'headline': 'Software Engineer interested in Java, Python, Ruby, Groovy, Bash, Clojure, ',
        'location': '\n          Portland, OR\n        '
}
# 2021-08-07 11:59:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://github.com/search?p=1&q=React+Django&type=Users>
{
    'github_link': '/django-wong',
    'name': ' Wong',
    'headline': 'PHP / Node.js / Dart (Flutter) / ',
    'location': '\n          China\n        '
}
[...]

Solution

The first problem can be fixed with xpath and string().

The second problem can be fixed with strip().

class GitHubSpider(scrapy.Spider):
    name = "github"
    start_urls = [
        "https://github.com/search?p=1&q=React+Django&type=Users",
    ]

    def strip_string(self, string):
        if string is not None:
            return string.strip()

    def parse(self, response):
        for github in response.css(".Box-row"):
            github_link = self.strip_string(github.css(".mr-1::attr(href)").get())
            name = self.strip_string(github.css(".mr-1::text").get())
            headline = self.strip_string(github.xpath('string(//p[@class="mb-1"])').get())
            location = self.strip_string(github.css(".mr-3:nth-child(1)::text").get())
            yield {
                "github_link": github_link,
                "name": name,
                "headline": headline,
                "location": location
            }

Answered By - SuperUser

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 23, 2022

[FIXED] Extracting data w/ Scrapy / XPath

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels