Issue
I'm using Scrapy with Python to scrape a page. My goals are to:
1.Get the href value from the a tag and append https:/careers.infinity.aero/ before the href value
2.Export this list to a csv
file
3.Run a 2nd script to pull those URLs for another scrape
I'm stuck at trying to get the concat to work for XPATH - I believe it's a disconnect for the syntax or the placement for the href, but I've not had much luck finding anything to help me.
Here is what I've got:
import scrapy
from scrapy.crawler import CrawlerProcess
class dgtest2(scrapy.Spider):
name = "dgtest2"
start_urls = [
'https://careers.infinity.aero/Careers.aspx'
]
custom_settings = {
'FEED_FORMAT': 'csv',
'FEED_URI': 'urls.csv'
}
def parse(self, response):
url = response.xpath('concat( string("https://careers.infinity.aero/"), //a/@href)').getall()
yield {
'URL': url,
}
process = CrawlerProcess()
process.crawl(dgtest2)
process.start()
I've had success with importing from csv
file in my 2nd script, I've had success with pulling the href using:
url = response.xpath('//a/@href').getall()
and exporting it to a csv
file, but the href values are only a partial URL, which is why I need to append.
Any info would be appreciate. Thanks in advance!
Solution
To concatenate two parts of a url into one, you can either use the standard library urljoin
function or you can use the scrapy
provided convinience method response.urljoin
.
USING RESPONSE OBJECT
def parse(self, response):
for url in response.xpath("//a/@href").getall():
yield {
'URL': response.urljoin(url),
}
USING STANDARD LIBRARY (take note of the import)
def parse(self, response):
from urllib.parse import urljoin
for url in response.xpath("//a/@href").getall():
yield {
'URL': urljoin("https:/careers.infinity.aero/", url),
}
Answered By - msenior_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.