Issue
Trying to scrape data from BGG for a project but after the 20th page, you're required to log in. I'm following this guide and checked out some others that used Scrapy but this was before they implemented the login past a certain page. I can't seem to figure out how to get Scrapy's request functions to work.
I'm trying to use Scrapy's [Requests and FormRequest.from_response] in a Spider (https://docs.scrapy.org/en/latest/topics/request-response.html) as shown here:
class BGGSpider(Spider):
name = "bgg"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
def start_requests(self):
yield scrapy.Request(
url='https://boardgamegeek.com/login/',
callback=self.login
)
def login(self, response):
return scrapy.FormRequest.from_response(
response,
formdata = {
'username': 'myname',
'password': 'mypassword',
},
callback=self.parse
)
def parse(self, response):
url = "https://www.boardgamegeek.com/browse/boardgame/page/"
for index in range(1):
yield Request(url=url+str(index+1), callback=self.parse_deeper, headers=self.headers, dont_filter=True)
and it returns " raise ValueError(f"No element found in {response}") ValueError: No element found in <200 https://boardgamegeek.com:443/login>"
but I checked the login portal and it looks like there's a form but I don't know how to get the requests to access it specifically or why it can't be found as a form (javascript embed maybe?). Help would be appreciated, thank you in advance!
the site has a form but I can't access it
Solution
You are right the page https://boardgamegeek.com/login/ doesn't not have the form for login, the real form is loaded via javascript, you can see the traffic of the web site using inspect
on the chrome browser it really helps me on my work on https://bitmaker.la
the real url you have to post your login is https://boardgamegeek.com/login/api/v1 and you will receive a 204
status then you can go to https://boardgamegeek.com/?rnd=0mcmt and start scraping
here is a spider version of the explanation:
import scrapy
from scrapy import Spider, Request
import json
class BGGSpider(Spider):
name = "bgg"
def start_requests(self):
yield scrapy.Request(
url='https://boardgamegeek.com/login/',
callback=self.login
)
def login(self, response):
yield scrapy.Request(
url='https://boardgamegeek.com/login/api/v1',
method="POST",
callback=self.parse,
dont_filter = True,
body=json.dumps({"credentials": {"username": "username", "password": "password"}}),
headers={
'authority': 'boardgamegeek.com',
'content-type': 'application/json',
'origin': 'https://boardgamegeek.com',
'referer': 'https://boardgamegeek.com/login',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
)
def parse(self, response):
url = "https://boardgamegeek.com/?rnd=0mcmt"
yield Request(url=url, callback=self.parse_deeper)
def parse_deeper(self, response):
print("we passed the login")
BTW the headers
are important
Answered By - Jgaldos
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.