Issue
I'm a mostly a lurker on this platform and try to solve my problems using the answer of already asked questions but I couldn't find a question to my current problem. I try to scrape data from this website website using scrapy. I'm already able to scrape most of the data I need however, there are two interactive highcharts i'd like to have the data from.Picture of first graph
What I tried so far:
- Extracting the data directly from the html response, but I can only access the axis values so this approach did not work work.
- Extract data by finding the API call with the dev Tools in the browser, similar to this approach. However the only XHR visible is called footprint and does not contain any response. In the initiator tab of the footproint is a Request callstack pointing to https://crowdcircus.com/js/app.js?id=6677107ebf6c7824be09 but I don't know if this helps anything since I'm really new to json and webscraping.
A hint and/or explanation how to scrape this chart data from this website would be much appreciated.
To see the graphs you have to login here.
I've created a throwaway account with:
email: [email protected]
, password: 12345
so you can see the data.
Update:
Sebastians answer pointed me to the right direction.
I ended up using scarpy_splash
which allows to execute javascript code with lua. With the code underneath I'm able to scrape all the data I needed.
LUA_SCRIPT = """
function main(splash)
-- Get cookies from previous session
splash:init_cookies(splash.args.cookies)
assert(splash:go(splash.args.url))
assert(splash:wait(0.5))
-- Extract data from page
-- Read amount of variables in second table
table_2_no_series = splash:evaljs('Highcharts.charts[1].series.length')
-- If second table has more variable then one, get this data aswell
if (table_2_no_series==2) or (table_2_no_series==3) then
table_2_y1_data = splash:evaljs('Highcharts.charts[1].series[0].yData')
table_2_y1_name = splash:evaljs('Highcharts.charts[1].series[0].name')
end
if (table_2_no_series==3) then
table_2_y3_data = splash:evaljs('Highcharts.charts[1].series[2].yData')
table_2_y3_name = splash:evaljs('Highcharts.charts[1].series[2].name')
end
return {
-- Extract webiste title
title = splash:evaljs('document.title'),
-- Extract first table data
table_1_name = splash:evaljs('Highcharts.charts[0].title.textStr'),
-- Extract Timestamps
table_1_x = splash:evaljs('Highcharts.charts[0].series[0].xAxis.categories'),
-- Extract Finanzierungsstand
table_1_y_data = splash:evaljs('Highcharts.charts[0].series[1].yData'),
table_1_y_name = splash:evaljs('Highcharts.charts[0].title.textStr'),
-- Extract second table data
table_2_y1_data,
table_2_y1_name,
table_2_y3_data,
table_2_y3_name,
cookies = splash:get_cookies(),
}
end
"""
SCRAPY_ARGS = {
'lua_source': LUA_SCRIPT,
'cookies' : self.cookies
}
# Look for json data if we sucessfully logged in
yield SplashRequest(url=response.url,
callback=self.parse_highchart_data,
endpoint='execute', args=SCRAPY_ARGS,
session_id="foo")
Note: The highchart api also has a .getCSV
which exports the data in csv format. However it seems like this site blocked this function.
Solution
It's not exactly a scrape/fetching approach, but from the Highcharts site, you can see the whole chart config using the web console tool. Try to use:
console.log(Highcharts.charts)
which shows the array of the rendered charts on the page. Next, go to particular chart -> series -> data, for example:
console.log(Highcharts.charts[0].series[1].data)
Answered By - Sebastian Wędzel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.