Issue
I'm trying to extract the latitude and longitude from this page: https://www.realestate.com.kh/buy/nirouth/4-bed-5-bath-twin-villa-143957/
Where it can be found in this part of the page (the Xpath of this part is /html/head/script[8]):
<script type="application/ld+json">{"@context":"http://schema.org","@type":"Residence","address":{"@type":"PostalAddress","addressRegion":"Phnom Penh","addressLocality":"Chbar Ampov"},"geo":{"@type":"GeoCoordinates","latitude":11.52,"longitude":104.95,"address":{"@type":"PostalAddress","addressRegion":"Phnom Penh","addressLocality":"Chbar Ampov"}}}</script>
Here's my script :
import scrapy
class ScrapingSpider(scrapy.Spider):
name = 'scraping'
# allowed_domains = ['https://www.realestate.com.kh/buy/']
start_urls = ['https://www.realestate.com.kh/buy/']
def parse(self, response):
lat = response.xpath('/html/head/script[8]')
print('----------------',lat)
yield {
'lat': lat
}
However, this Xpath yield an empty list. Is is because the content I'm looking for is in a JS script?
Solution
Since scrapy doesn't execute js, some <script>
tag may be not be loaded into the page. For this reason using a index to pinpoint the element you want isn't a good idea. Better to search for something specific, my suggestion would be:
response.xpath('//head/script[contains(text(), "latitude")]')
Edit:
The above selector will return a selector list, from it you can choose how to parse. If you want to extract the whole text in script
you can use:
response.xpath('//head/script[contains(text(), "latitude")]/text()').get()
If you want only the latitude value, you can use a regex:
response.xpath('//head/script[contains(text(), "latitude")]/text()').re_first(r'"latitude":(\d{1,3}\.\d{1,2})')
Docs on using regex methods of Selectors.
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.