Tuesday, February 22, 2022

[FIXED] Scrapy xpath not extracting div containing special characters <%=

February 22, 2022 html, javascript, python, scrapy No comments

Issue

I am new to Scrapy. I am trying to extract the h2 text from the following URL: 'https://www.tysonprop.co.za/agents/'

I have 2 problems:

My xpath can get to the script element, but it cannot find the h2 or the div elements inside the script tag. I've even tried saving the HTML file to my machine and scraping this file, but the same problem occurs. I have triple checked my xpath code, all seems in order.
When the website is displayed in my browser, branch.branch_name resolves to "Tysen Properties Head Office". How would one get the value (i.e. "Tysen Properties Head Office") instead of the variable name (branch.branch_name)?

My Python code:

import scrapy

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/agents/'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):

        script = response.xpath('//script[@id="id_branch_template"]')
        div = script.xpath('./div[contains(@class,"branch-container")]')
        h2 = div.xpath('/h2[contains(@class,"branch-name")]/text()').extract()
        yield {'branchName': h2}

HTML extract below:

<script type="text/html" id="id_branch_template">
  <div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
    <h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
    <div class="branch-agents container_12 first last clearfix">
      <div id="agents-list-left" class="agents-list left grid_6">
      </div>
      <div id="agents-list-right" class="agents-list right grid_6">
      </div>
    </div>
  </div>
</script>

Solution

Does branch.branch_name looks like a address in JSON format? is there a call which loads data you are looking for ? maybe, let's see

By looking through your browser developer tool you can find requests in network tab and by searching between them you will face this AJAX call which loads exactly the data you are looking for. so:

import json
import scrapy

class TysonSpider(scrapy.Spider):
    name = 'tyson_spider'

    def start_requests(self):
        url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        json_data = json.loads(response.text)
        branch_name = json_data['branch']['branch_name']
        yield {'branchName': branch_name}

Answered By - Moein Kameli

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, February 22, 2022

[FIXED] Scrapy xpath not extracting div containing special characters <%=

Issue

My Python code:

Solution

0 comments:

Post a Comment

Popular Posts

Labels