Issue
I want Scrapy to extract the 'Round Size' in this case. But it turns out the Scrapy couldn't capture any child nodes li under dl.
response.xpath('//[@id="termsheet"]/div/section[1]/div/dl/li[2]/dt/span').extract()
The Xpath expression is generated from Chome inspect. And I test the expression separately, it could capture li tags. I enabled Ajax in Scrapy, and it could capture other dynamic items. Is there any other reasons leading to the data miss of Scrapy? Anyone who have encountered the similar problems?
Solution
https://www.seedinvest.com/mf.fire/seed/termsheet loads "Round size" using some JavaScript, from data fetched from an API endpoint (in this case https://www.seedinvest.com/api/v1/entities/mf.fire/deal-fundraising-profile/seed -- you can inspect the network queries in your browser's "Tools" panel, e.g in Chrome)
The API endpoint returns data as JSON (there's quite a lot of data!), so you can feed it to std lib json
module, like in the example below (using scrapy shell)
$ scrapy shell https://www.seedinvest.com/api/v1/entities/mf.fire/deal-fundraising-profile/seed
2016-06-06 11:36:56 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
(...)
2016-06-06 11:36:58 [scrapy] DEBUG: Crawled (200) <GET https://www.seedinvest.com/api/v1/entities/mf.fire/deal-fundraising-profile/seed> (referer: None)
(...)
>>> import json
>>> d = json.loads(response.text)
>>> d['funding_round']['escrow_max']
1000000.0
Answered By - paul trmbrth
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.