Sunday, October 17, 2021

[FIXED] Trouble logging in with Scrapy on this website

October 17, 2021 authentication, python, scrapy No comments

Issue

I am trying to create a crawler that first logs in to the website and then continues on crawling to other pages.

The website is https://login.globo.com/login/6668?url=https://valor.globo.com/

After fiddling around a bit, I came up with this (I have imported the libraries and stuff):

class CrawlSite(scrapy.Spider):
name = 'WebCrawl'

start_urls = ('https://login.globo.com/login/6668?url=https://valor.globo.com/')

def login_valor(self, response):
    return FormRequest.from_response(response,
                                     formdata={
                                               'password': 'password.',
                                               'login': 'username'},
                                     callback=self.scrape_links)
def scrape_links(self):
    urls = ['https://valor.globo.com/impresso/20200501/']   
    
    
    for url in urls:
            yield scrapy.Request(url, callback= self.parse_normal)

After reading, I understood that I should try to find the authentication method, but I've had no luck.

The rest is working fine (scraping the webpage links)

Thanks!

Solution

A FormRequest.from_response will not work in this case, as there is no form visible for Scrapy (it's loaded dynamically). Either you render the page with something like Splash, or you create the post-request yourself. You can figure out how it works by opening Developer Tools in Chrome, and check the 'network' tab when you log in manually. Based on that I think the below code should work (I can't really test because I don't have a login for the website):

import scrapy
import json


class CrawlSite(scrapy.Spider):
    name = 'WebCrawl'
    start_urls = ['https://login.globo.com/login/6668?url=https://valor.globo.com/']
    login_url = 'https://login.globo.com/api/authentication'
    username = 'test_username'
    password = 'test_password'
    headers = {'authority': 'login.globo.com',
               'referer': 'https://login.globo.com/login/6668?url=https'
                          '://valor.globo.com/',
               'origin': 'https://login.globo.com',
               'content-type': 'application/json; charset=UTF-8',
               'accept': 'application/json, text/javascript',
               'accept-language': 'en-GB,en;q=0.9,nl-BE;q=0.8,nl;q=0.7,'
                                  'ro-RO;q=0.6,ro;q=0.5,en-US;q=0.4,fr;q=0.3,'
                                  'it;q=0.2',}

    def parse(self, response):
        payload = {
            'payload': {
                'email': self.username,
                'password': self.password,
                'serviceId': 6668,  # better get this value from the html
            },
            'captcha': ''
        }

        yield scrapy.Request(
            url='https://login.globo.com/api/authentication',
            body=json.dumps(payload),
            method='POST',
            headers=self.headers,
            callback=self.scrape_links
        )

    def scrape_links(self, response):
        urls = ['https://valor.globo.com/impresso/20200501/']
        for url in urls:
            yield scrapy.Request(url, callback=self.parse_normal)

    def parse_normal(self, response):
        pass

Answered By - Wim Hermans

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, October 17, 2021

[FIXED] Trouble logging in with Scrapy on this website

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels