Issue
I tried to scrape Kickstarter. However I do not get a result when I try to get the URLs that refer to the projects.
This should be one of the results:
and this is my code:
Code:
main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?
category_id=1&sort=end_date&seed=2498921&page=1")
urls1 <- main.page1 %>% # feed `main.page` to the next step
html_nodes(".block.img-placeholder.w100p") %>% # get the CSS nodes
html_attr("href") # extract the URLs
Does anyone see where I go wrong?
Solution
First declare all the packages you use - I had to go search to realise I needed rvest
:
> library(rvest)
> library(dplyr)
Get your HTML:
> main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?category_id=1&sort=end_date&seed=2498921&page=1")
As that stands, the data for each project is stashed in a data-project
attribute in a bunch of divs. Some Javascript (I suspect built using the React framework) in the browser will normally fill the other DIVs in and get the images, format the links etc. But you have just grabbed the raw HTML so that isn't available. But the raw data is.... So....
The relevant divs appear to be class "react-disc-landing" so this gets the data as text strings:
> data = main.page1 %>%
html_nodes("div.react-disc-landing") %>%
html_attr("data-project")
These things appear to be JSON strings:
> substr(data[[1]],1,80)
[1] "{\"id\":208460273,\"photo\":{\"key\":\"assets/017/007/465/9b725fdf5ba1ee63e8987e26a1d33"
So let's use the rjson
package to decode the first one:
> library(rjson)
> jdata = fromJSON(data[[1]])
jdata
is now a very complex nested list. Use str(jdata)
to see what is in it. I'm not sure what bit of it you want, but maybe this URL:
> jdata$urls$web$project
[1] "https://www.kickstarter.com/projects/1513052868/sense-of-place-by-jose-davila"
If not, the URL you want must be in that structure somewhere.
Repeat over data[[i]]
to get all links.
Note that you should check the site T+Cs that you are allowed to do this, and also see if there's an API you should really be using.
Answered By - Spacedman
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.