Tuesday, March 8, 2022

[FIXED] Scraping Kickstarter Project Page with Python

March 08, 2022 beautifulsoup, kickstarter, python-3.x, selenium, web-scraping No comments

Issue

For over a year I was using the code below to scrape certain Kickstarter pages as part of my daily job. Nothing malicious or ill-intended, just need to get some information from the page to help the project creator.

But for the past 4 - 6 months Kickstarter implemented some sort of blocker and it's preventing me from reaching/scraping the actual page. All I get back is Backer or bot? Complete this security check to prove that you’re a human. Once you’ve passed this page, you might need to navigate away from your current screen on Kickstarter to refresh and move on. To avoid seeing this page again, double-check that JavaScript and cookies are enabled on your web browser and that you’re not blocking them from loading with an extension (e.g., ad blockers).

Anyone can think of a way to circumvent this check and actually land on the page? Any input would be greatly helpful.

import os
import sys
import requests
import time
import urllib
import urllib.request
import shutil
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from csv import writer
from shutil import copyfile

print('What is the project URL?')
urlInp = input()

elClass = "rte__content"

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)

driver.get(urlInp)
time.sleep(2)
html = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()

soup = BeautifulSoup(html, 'lxml')
ele = soup.find('div', {'class': elClass})

print(soup)
quit()

Solution

Looking at your script - it looks like you're trying to get the story.

Selenium is great for GUI testing but it announces who it is to the website to help prevent DOS attacks. Read more on the docs if you want to know more. The way I see it is these sites are going out of the way to blocking GUI automation for a reason. They have lots of clever people working on it so it's going to be an uphill battle trying to beat them.

As a better alternative have you thought about using the requests library? - this will allow you to simulate the calls without essentially needing a browser

I've looked in devtools and there's even an API that can get the story information for you. You need a csrf token, and you need to post some data (that's already available in your url). This will run significantly faster than selenium and allow you to do a lot more.

This is some code I've put together for you. I picked a random kickstarter page and it's hard coded into this demo:

urlInp = 'https://www.kickstarter.com/projects/iamlunasol/soft-like-mochi-enamel-pins?ref=section-homepage-featured-project'


#start a session - this stores cookies
s = requests.session()

# go here to get  cookies and the token
landing = s.get(urlInp) 
page = html.fromstring(landing.content)
csrf = page.xpath('//meta[@name="csrf-token"]')[0].get('content')
headers={} 
headers['x-csrf-token'] = csrf


#hit the api with the data
graphslug = urlInp.split("projects/")[1]
graphslug = graphslug.split("?")[0]
graphData= [{
        "operationName": "Campaign",
        "variables": {
            "slug": graphslug
        },
        "query": "query Campaign($slug: String!) {\n  project(slug: $slug) {\n    id\n    isSharingProjectBudget\n    risks\n    showRisksTab\n    story(assetWidth: 680)\n    currency\n    spreadsheet {\n      displayMode\n      public\n      url\n      data {\n        name\n        value\n        phase\n        rowNum\n        __typename\n      }\n      dataLastUpdatedAt\n      __typename\n    }\n    environmentalCommitments {\n      id\n      commitmentCategory\n      description\n      __typename\n    }\n    __typename\n  }\n}\n"
    }]

response = s.post("https://www.kickstarter.com/graph", json=graphData, headers=headers)

#process the response
graph_json = response.json()
story = graph_json[0]['data']['project']['story']
soup = BeautifulSoup(story, 'lxml')
print(soup)

The first few lines of the output is:

<html><body><p>Hi! I'm Felice Regina (<a href="https://www.instagram.com/iamlunasol/" rel="noopener" target="_blank">@iamlunasol</a> on Instagram) but everyone just calls me Luna! I'm an independent illustrator and pin designer! I've run many successful 
Kickstarter campaigns for enamel pins over the past few years. This campaign will help put new hard enamel pin designs into production.</p>
<p>Pledging ensures that the pins get produced, discounts when you purchase multiple pins, plus any freebies that we may unlock. If the campaign is successful, any extra pins will be sold at $12 + shipping in my <a href="https://shopiamlunasol.com/" rel="noopener" target="_blank">web store</a>.</p>

This ties back to the story seen in the json on devtools - preview tab is good for this:

And finally if you're looking to adapt this to use other queries, you can get an idea of the json data to send from the headers tab in the request payload:

Answered By - RichEdwards

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, March 8, 2022

[FIXED] Scraping Kickstarter Project Page with Python

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels