Issue
UPDATE: this number 48 is showed in "Inspect" in Chrome, but not in "View Page Source". Now understand that it is generated by JavaScript and that is why I can not extract it.
This is part of HTML that I am trying to scrape
<span class="value">
<span class="base-entity-display-count">48</span>
"times"
</span>
Problem is that I can not get this 48 number.
I think that problem is because there are no "" around 48.
Because I can get "times" text with no problems, and the only difference that I can see is that there are no "" around 48.
This is code that is working for "times":
response.xpath('.//span[@class="value"]/text()').extract_first()
>>> u'<span class="value"><span class="base-entity-display-count"></span>times</span>'
For 48:
response.xpath('.//span[@class="base-entity-display-count"]').extract_first()
>>> u'<span class="base-entity-display-count"></span>'
As you can see, 48 is missing.
Does anybody have some solution or idea?
Solution
If you look at the body of the page and search for your number you can see that there's some embeded json.
To solve this you can:
find embeded json with regex:
import re # select everything between "ap.boot.push(" and ");" data = re.findall('app.boot.push\((\{.+?\})\);', response.body_as_unicode())
load up json and parse it with python to find the values you want:
import json data = [json.loads(d) for d in data] for d in data: if d.get('name') == 'BaseEntityDetails': print(d['values']['displayCountText']) #prints: 66
Answered By - Granitosaurus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.