Thursday, March 24, 2022

[FIXED] why scrapy can not find xpath that is found in my browser xpath?

March 24, 2022 scrapy No comments

Issue

Im a newby to scrapy and Im having dificulties extracting the price but not the name using the code below. Any idea what Im doing wrong to get the price? Thank you!

This is the code:

import scrapy
class BfPreciosSpider(scrapy.Spider):
    name = 'BF_precios'
    allowed_domains = ['https://www.boerse-frankfurt.de']
    start_urls = ['https://www.boerse-frankfurt.de/anleihe/xs1186131717-fce-bank-plc-1-134-15-22']
    def  parse(self, response):
                what_name=response.xpath('/html/body/app-root/app-wrapper/div/div[2]/app-bond/div[1]/div/app-widget-datasheet-header/div/div/div/div/div[1]/div/h1/text()').extract_first()
                what_price=response.xpath('/html/body/app-root/app-wrapper/div/div[2]/app-bond/div[2]/div[3]/div[1]/font/text()').extract_first()
                yield{'name': what_name , 'price': what_price}

And these are the items(in red) - name and price:

Solution

The name information is available directly on the page but the price information is obtained from an api. If you investigate the Network traffic you will find an api call that returns the price information. See below example of how you could obtain this data.

import scrapy
from time import time

class RealtorSpider(scrapy.Spider):
    name = 'BF_precios'
    allowed_domains = ['boerse-frankfurt.de']
    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36'
    }
    start_urls = ['https://www.boerse-frankfurt.de/anleihe/xs1186131717-fce-bank-plc-1-134-15-22']

    def parse(self, response):
        item = {}
        current_time = int(time())
        name = response.xpath('//h1/text()').get()
        isin = response.xpath("//span[contains(text(),'ISIN:')]/text()").re_first(r"ISIN:\s(.*)$")
        mic = response.xpath("//app-widget-index-price-information/@mic").get()
        api_url = f"https://api.boerse-frankfurt.de/v1/tradingview/lightweight/history/single?\
            resolution=D&isKeepResolutionForLatestWeeksIfPossible=false\
            &from={current_time}&to={current_time}&isBidAskPrice=false&symbols={mic}%3A{isin}"

        item['name'] = name
        item['isin'] = isin
        item['mic'] = mic
        yield response.follow(api_url, callback=self.parse_price, cb_kwargs={"item": item})

    def parse_price(self, response, item):
        item['price'] = response.json()[0]['quotes']['timeValuePairs'][0]['value']
        yield item

Running the above spider will yield a dictionary similar to the below

{'name': 'FCE Bank PLC 1,134% 15/22', 'isin': 'XS1186131717', 'mic': 'XFRA', 'price': 99.955}

Answered By - msenior_

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, March 24, 2022

[FIXED] why scrapy can not find xpath that is found in my browser xpath?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels