Thursday, June 30, 2022

Scrapy - How to append text to href

June 30, 2022 python, scrapy, web-scraping, xpath No comments

Issue

I'm using Scrapy with Python to scrape a page. My goals are to:

1.Get the href value from the a tag and append https:/careers.infinity.aero/ before the href value 2.Export this list to a csv file 3.Run a 2nd script to pull those URLs for another scrape

I'm stuck at trying to get the concat to work for XPATH - I believe it's a disconnect for the syntax or the placement for the href, but I've not had much luck finding anything to help me.

Here is what I've got:

import scrapy
from scrapy.crawler import CrawlerProcess


class dgtest2(scrapy.Spider):
    name = "dgtest2"
    start_urls = [
    'https://careers.infinity.aero/Careers.aspx'
    ]

    custom_settings = {
    'FEED_FORMAT': 'csv',
    'FEED_URI': 'urls.csv'
    }

    def parse(self, response):
        url = response.xpath('concat( string("https://careers.infinity.aero/"), //a/@href)').getall()
        yield {
            'URL': url,
        }

process = CrawlerProcess()
process.crawl(dgtest2)
process.start()

I've had success with importing from csv file in my 2nd script, I've had success with pulling the href using:

url = response.xpath('//a/@href').getall()

and exporting it to a csv file, but the href values are only a partial URL, which is why I need to append.

Any info would be appreciate. Thanks in advance!

Solution

To concatenate two parts of a url into one, you can either use the standard library urljoin function or you can use the scrapy provided convinience method response.urljoin.

USING RESPONSE OBJECT

def parse(self, response):
        for url in response.xpath("//a/@href").getall():
            yield {
            'URL': response.urljoin(url),
            }

USING STANDARD LIBRARY (take note of the import)

def parse(self, response):
        from urllib.parse import urljoin
        for url in response.xpath("//a/@href").getall():
            yield {
            'URL': urljoin("https:/careers.infinity.aero/", url),
            }

Answered By - msenior_

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, June 30, 2022

[FIXED] XPATH/Python/Scrapy - How to append text to href

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels