Thursday, November 10, 2022

[FIXED] Scraping Kickstarter With R?

November 10, 2022 r, scrapy, web-scraping No comments

Issue

I tried to scrape Kickstarter. However I do not get a result when I try to get the URLs that refer to the projects.

This should be one of the results:

https://www.kickstarter.com/projects/1534822242/david-bowie-hunger-city-photo-story?ref=category_ending_soon

and this is my code:

Code:

    main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?
    category_id=1&sort=end_date&seed=2498921&page=1")

    urls1 <- main.page1 %>% # feed `main.page` to the next step
            html_nodes(".block.img-placeholder.w100p") %>% # get the CSS nodes
            html_attr("href") # extract the URLs

Does anyone see where I go wrong?

Solution

First declare all the packages you use - I had to go search to realise I needed rvest:

> library(rvest)
> library(dplyr)

Get your HTML:

> main.page1 <- read_html(x ="https://www.kickstarter.com/discover/advanced?category_id=1&sort=end_date&seed=2498921&page=1")

As that stands, the data for each project is stashed in a data-project attribute in a bunch of divs. Some Javascript (I suspect built using the React framework) in the browser will normally fill the other DIVs in and get the images, format the links etc. But you have just grabbed the raw HTML so that isn't available. But the raw data is.... So....

The relevant divs appear to be class "react-disc-landing" so this gets the data as text strings:

> data = main.page1 %>% 
    html_nodes("div.react-disc-landing") %>% 
    html_attr("data-project")

These things appear to be JSON strings:

> substr(data[[1]],1,80)
[1] "{\"id\":208460273,\"photo\":{\"key\":\"assets/017/007/465/9b725fdf5ba1ee63e8987e26a1d33"

So let's use the rjson package to decode the first one:

> library(rjson)
> jdata = fromJSON(data[[1]])

jdata is now a very complex nested list. Use str(jdata) to see what is in it. I'm not sure what bit of it you want, but maybe this URL:

> jdata$urls$web$project
[1] "https://www.kickstarter.com/projects/1513052868/sense-of-place-by-jose-davila"

If not, the URL you want must be in that structure somewhere.

Repeat over data[[i]] to get all links.

Note that you should check the site T+Cs that you are allowed to do this, and also see if there's an API you should really be using.

Answered By - Spacedman

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 10, 2022

[FIXED] Scraping Kickstarter With R?

Issue

Solution

Note that you should check the site T+Cs that you are allowed to do this, and also see if there's an API you should really be using.

0 comments:

Post a Comment

Popular Posts

Labels