Issue
I am new to Scrapy. I am trying to extract the h2 text from the following URL: 'https://www.tysonprop.co.za/agents/'
I have 2 problems:
My xpath can get to the script element, but it cannot find the h2 or the div elements inside the script tag. I've even tried saving the HTML file to my machine and scraping this file, but the same problem occurs. I have triple checked my xpath code, all seems in order.
When the website is displayed in my browser, branch.branch_name resolves to "Tysen Properties Head Office". How would one get the value (i.e. "Tysen Properties Head Office") instead of the variable name (branch.branch_name)?
My Python code:
import scrapy
class TysonSpider(scrapy.Spider):
name = 'tyson_spider'
def start_requests(self):
url = 'https://www.tysonprop.co.za/agents/'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
script = response.xpath('//script[@id="id_branch_template"]')
div = script.xpath('./div[contains(@class,"branch-container")]')
h2 = div.xpath('/h2[contains(@class,"branch-name")]/text()').extract()
yield {'branchName': h2}
HTML extract below:
<script type="text/html" id="id_branch_template">
<div id="branch-<%= branch.id %>" class="clearfix margin-top30 branch-container" style="display: none;">
<h2 class="grid_12 branch-name margin-bottom20"><%= branch.branch_name %></h2>
<div class="branch-agents container_12 first last clearfix">
<div id="agents-list-left" class="agents-list left grid_6">
</div>
<div id="agents-list-right" class="agents-list right grid_6">
</div>
</div>
</div>
</script>
Solution
Does branch.branch_name
looks like a address in JSON format? is there a call which loads data you are looking for ? maybe, let's see
By looking through your browser developer tool you can find requests in network tab and by searching between them you will face this AJAX call which loads exactly the data you are looking for. so:
import json
import scrapy
class TysonSpider(scrapy.Spider):
name = 'tyson_spider'
def start_requests(self):
url = 'https://www.tysonprop.co.za/ajax/agents/?branch_id=25'
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
json_data = json.loads(response.text)
branch_name = json_data['branch']['branch_name']
yield {'branchName': branch_name}
Answered By - Moein Kameli
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.