Issue
I am trying to create a crawler that first logs in to the website and then continues on crawling to other pages.
The website is https://login.globo.com/login/6668?url=https://valor.globo.com/
After fiddling around a bit, I came up with this (I have imported the libraries and stuff):
class CrawlSite(scrapy.Spider):
name = 'WebCrawl'
start_urls = ('https://login.globo.com/login/6668?url=https://valor.globo.com/')
def login_valor(self, response):
return FormRequest.from_response(response,
formdata={
'password': 'password.',
'login': 'username'},
callback=self.scrape_links)
def scrape_links(self):
urls = ['https://valor.globo.com/impresso/20200501/']
for url in urls:
yield scrapy.Request(url, callback= self.parse_normal)
After reading, I understood that I should try to find the authentication method, but I've had no luck.
The rest is working fine (scraping the webpage links)
Thanks!
Solution
A FormRequest.from_response will not work in this case, as there is no form visible for Scrapy (it's loaded dynamically). Either you render the page with something like Splash, or you create the post-request yourself. You can figure out how it works by opening Developer Tools in Chrome, and check the 'network' tab when you log in manually. Based on that I think the below code should work (I can't really test because I don't have a login for the website):
import scrapy
import json
class CrawlSite(scrapy.Spider):
name = 'WebCrawl'
start_urls = ['https://login.globo.com/login/6668?url=https://valor.globo.com/']
login_url = 'https://login.globo.com/api/authentication'
username = 'test_username'
password = 'test_password'
headers = {'authority': 'login.globo.com',
'referer': 'https://login.globo.com/login/6668?url=https'
'://valor.globo.com/',
'origin': 'https://login.globo.com',
'content-type': 'application/json; charset=UTF-8',
'accept': 'application/json, text/javascript',
'accept-language': 'en-GB,en;q=0.9,nl-BE;q=0.8,nl;q=0.7,'
'ro-RO;q=0.6,ro;q=0.5,en-US;q=0.4,fr;q=0.3,'
'it;q=0.2',}
def parse(self, response):
payload = {
'payload': {
'email': self.username,
'password': self.password,
'serviceId': 6668, # better get this value from the html
},
'captcha': ''
}
yield scrapy.Request(
url='https://login.globo.com/api/authentication',
body=json.dumps(payload),
method='POST',
headers=self.headers,
callback=self.scrape_links
)
def scrape_links(self, response):
urls = ['https://valor.globo.com/impresso/20200501/']
for url in urls:
yield scrapy.Request(url, callback=self.parse_normal)
def parse_normal(self, response):
pass
Answered By - Wim Hermans
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.