Issue
I am trying to pull specific data for some projects listed in kickstarter.com.
Kickstarter.com uses GraphQL and I am trying to make a copy of that API, it worked with Python requests library but when I used it in scrapy requests it kept returning error 403.
I assume the problem is in content-type but I failed to find the correct one that I should use, noting that it works exactly like this when I use normal requests library.
def start_requests(self):
url = "https://www.kickstarter.com/graph"
payload = json.dumps([
{
"operationName": "Campaign",
"variables": {
"slug": "leightonconnor/akashic-titan-blue-bolt"
},
"query": "query Campaign($slug: String!) {\n project(slug: $slug) {\n id\n isSharingProjectBudget\n risks\n story(assetWidth: 680)\n currency\n spreadsheet {\n displayMode\n public\n url\n data {\n name\n value\n phase\n rowNum\n __typename\n }\n dataLastUpdatedAt\n __typename\n }\n environmentalCommitments {\n id\n commitmentCategory\n description\n __typename\n }\n __typename\n }\n}\n"
}
])
headers = {
'content-type': 'application/json',
'x-csrf-token': 'AZsT67Z9s-LHZt6ZJXLSQWJlNdd7biKz2XDfFMkcYMZrNufH1OWoFhNBlXIvxCrxKRzV6l8bG_Z6QlcRoYMe_g',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
'cookie': '_ksr_session=fc2U7qXXaRN91foNiE53NyU3s181NZO0Ll57xPkYxZ5iyUNgus35a0HwsPBTfViBY%2ByAKbtpRirAVLxOGKzG%2BYMOmsLRBPujZep%2Fca%2B1%2FXzW3xX56VXkh5w6ItYhIctEFifQQhw3rTmvoljyHw%3D%3D--4pK6xBEgChjqgmte--LH4Q1qSnhU%2FYX9JgTzGuSQ%3D%3D;'
}
print('..ok')
yield scrapy.Request(url, method="POST", headers=headers, body=payload, callback=self.parse_project)
Returns:
2022-02-23 07:06:55 [scrapy.core.engine] DEBUG: Crawled (403) <POST https://www.kickstarter.com/graph> (referer: None)
2022-02-23 07:06:55 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.kickstarter.com/graph>: HTTP status code is not handled or not allowed
Code in Python Requests (works):
import requests
import json
url = "https://www.kickstarter.com/graph"
payload = json.dumps([
{
"operationName": "Campaign",
"variables": {
"slug": "leightonconnor/akashic-titan-blue-bolt"
},
"query": "query Campaign($slug: String!) {\n project(slug: $slug) {\n id\n isSharingProjectBudget\n risks\n story(assetWidth: 680)\n currency\n spreadsheet {\n displayMode\n public\n url\n data {\n name\n value\n phase\n rowNum\n __typename\n }\n dataLastUpdatedAt\n __typename\n }\n environmentalCommitments {\n id\n commitmentCategory\n description\n __typename\n }\n __typename\n }\n}\n"
}
])
headers = {
'content-type': 'application/json',
'x-csrf-token': 'AZsT67Z9s-LHZt6ZJXLSQWJlNdd7biKz2XDfFMkcYMZrNufH1OWoFhNBlXIvxCrxKRzV6l8bG_Z6QlcRoYMe_g',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
'cookie': '_ksr_session=fc2U7qXXaRN91foNiE53NyU3s181NZO0Ll57xPkYxZ5iyUNgus35a0HwsPBTfViBY%2ByAKbtpRirAVLxOGKzG%2BYMOmsLRBPujZep%2Fca%2B1%2FXzW3xX56VXkh5w6ItYhIctEFifQQhw3rTmvoljyHw%3D%3D--4pK6xBEgChjqgmte--LH4Q1qSnhU%2FYX9JgTzGuSQ%3D%3D;'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.status_code)
print(response.json()[0]['data']['project']['risks'])
Solution
Here's how it worked for me:
- Open the page you want to scrape.
- Look under the network tab in inspection tools.
- find the GraphQl request that contains the information you want.
- right click on it and go to copy > copy as curl (bash). (This is assuming that you are using chrome, I think other browsers have it too but I use chrome).
- Go to curl2scrapy and paste your curl command. it will give you the headers and payload.
- Before you run it replace all
\n
in the query with\\n
.
Answered By - zaki98
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.