Wednesday, January 24, 2024

[FIXED] Scrapy can't Login to BoardGameGeek

January 24, 2024 python, scrapy No comments

Issue

Trying to scrape data from BGG for a project but after the 20th page, you're required to log in. I'm following this guide and checked out some others that used Scrapy but this was before they implemented the login past a certain page. I can't seem to figure out how to get Scrapy's request functions to work.

I'm trying to use Scrapy's [Requests and FormRequest.from_response] in a Spider (https://docs.scrapy.org/en/latest/topics/request-response.html) as shown here:

class BGGSpider(Spider):
    name = "bgg"
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}

    def start_requests(self):
        yield scrapy.Request(
            url='https://boardgamegeek.com/login/',
            callback=self.login
        )

    def login(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata = {
                'username': 'myname', 
                'password': 'mypassword',
            },
            callback=self.parse
        )

    def parse(self, response):
            url = "https://www.boardgamegeek.com/browse/boardgame/page/"
            for index in range(1):
                yield Request(url=url+str(index+1), callback=self.parse_deeper, headers=self.headers, dont_filter=True)

and it returns " raise ValueError(f"No element found in {response}") ValueError: No element found in <200 https://boardgamegeek.com:443/login>"

but I checked the login portal and it looks like there's a form but I don't know how to get the requests to access it specifically or why it can't be found as a form (javascript embed maybe?). Help would be appreciated, thank you in advance!

the site has a form but I can't access it

Solution

You are right the page https://boardgamegeek.com/login/ doesn't not have the form for login, the real form is loaded via javascript, you can see the traffic of the web site using inspect on the chrome browser it really helps me on my work on https://bitmaker.la

the real url you have to post your login is https://boardgamegeek.com/login/api/v1 and you will receive a 204 status then you can go to https://boardgamegeek.com/?rnd=0mcmt and start scraping

here is a spider version of the explanation:

import scrapy
from scrapy import Spider, Request
import json

class BGGSpider(Spider):
    name = "bgg"

    def start_requests(self):
        yield scrapy.Request(
            url='https://boardgamegeek.com/login/',
            callback=self.login
        )

    def login(self, response):
        yield scrapy.Request(
            url='https://boardgamegeek.com/login/api/v1',
            method="POST",
            callback=self.parse,
            dont_filter = True,
            body=json.dumps({"credentials": {"username": "username", "password": "password"}}),
            headers={
                'authority': 'boardgamegeek.com',
                'content-type': 'application/json',
                'origin': 'https://boardgamegeek.com',
                'referer': 'https://boardgamegeek.com/login',
                'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
            }
        )

    def parse(self, response):
        url = "https://boardgamegeek.com/?rnd=0mcmt"
        yield Request(url=url, callback=self.parse_deeper)

    def parse_deeper(self, response):
        print("we passed the login")

BTW the headers are important

Answered By - Jgaldos

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 24, 2024

[FIXED] Scrapy can't Login to BoardGameGeek

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels