Issue
I am trying to pull all the TD values from the table="table-main" from the website: http://www.oddsportal.com/basketball/usa/nba/results/
I am using Scrapy and Python 2.7
From Scrapy Shell I can get the table via:
response.xpath('//*[@id="tournamentTable"]')
But I cannot seem to get any of the TR or TD of that table.
response.xpath('//*[@id="tournamentTable"]/tbody')
and response.xpath('//*[@id="tournamentTable"]/tbody/tr')
returns an empty list. I suspect that perhaps the table is created dynamically. How can I scrape all the team names, scores, and odds from that table?
Note on possible duplicate
This question is different to what people recommend is a duplicate here: Scrapy not finding table because that question is about getting the table. This question is about getting the data in the table.
Solution
Yes, the results are loaded with an additional call to the website API. In this case the request is made to http://fb.oddsportal.com/ajax-sport-country-tournament-archive/3/MmbLsWh8/X0/1/-1/1/?_=1446338252826.
I'm not sure you can hardcode the URL in your spider since, at least, there are these 3
and MmbLsWh8
parts of the URL that are actually coming from a script
tag on the main page:
<script type="text/javascript">
//<![CDATA[
var op = new OpHandler();if(!page)var page = new PageTournament({"id":"MmbLsWh8","sid":3,"cid":200,"archive":true});var menu_open = null;vJs();op.init();if(page && page.display)page.display(); var sigEndPage = true;
try
{
if (sigEndJs)
{
globals.onPageReady();
}
} catch (e)
{
}
//]]>
</script>
Plus, there is a _
parameter, that looks like a timestamp.
The call to this AJAX url would return you a JSONP response with an HTML code of the NBA results inside. You need to extract the HTML code from the response (with a regular expressions, for instance), feed it to a Selector
and extract the results. Some sample code from the shell to get you started:
$ scrapy shell http://www.oddsportal.com/basketball/usa/nba/results/
In [1]: fetch("http://fb.oddsportal.com/ajax-sport-country-tournament-archive/3/MmbLsWh8/X0/1/-1/1/?_=1446338252826")
In [2]: import re
In [3]: pattern = re.compile(r'"html":"(.*?)"}', re.MULTILINE | re.DOTALL)
In [4]: import scrapy
In [5]: selector = scrapy.Selector(text=pattern.search(response.body).group(1))
In [6]: # TODO: now use the selector to extract the desired data
Answered By - alecxe
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.