Friday, November 26, 2021

[FIXED] Scrapy crawl every link after authentication

November 26, 2021 python, python-3.x, scrapy, web-crawler, xpath No comments

Issue

Introduction

Since my crawler is more or less finished yet, i need to redo a crawler which only crawls whole domain for links, i need this for my work. The spider which crawls every link should run once per month.

I'm running scrapy 2.4.0 and my os is Linux Ubuntu server 18.04 lts

Problem

The website which i have to crawl changed their "privacy", so you have to be logged in before you can see the products, which is the reason why my "linkcrawler" wont work anymore. I already managed to login and scrape all my stuff, but the start_urls where given in a csv file.

Code

import scrapy
from ..items import DuifItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import FormRequest, Request
from scrapy_splash import SplashRequest

class DuifLinkSpider(CrawlSpider):
    name = 'duiflink'
    allowed_domains = ['duif.nl']
    login_page = 'https://www.duif.nl/login'
    start_urls = ['https://www.duif.nl']
    custom_settings = {'FEED_EXPORT_FIELDS' : ['Link']}
    
    def start_requests(self):
        yield SplashRequest(
        url=self.login_page,
        callback=self.parse_login,
        args={'wait': 3},
        dont_filter=True    
        )
    
    rules = (
       Rule(LinkExtractor(deny='https://www.duif.nl/nl/'), callback='parse_login', follow=True), 
    )
   
    def parse_login(self, response):
        return FormRequest.from_response(
            response,
            formid='login-form',
            formdata={
                'username' : 'not real',
                'password' : 'login data'},
            clickdata={'type' : 'submit'}, 
            callback=self.after_login)
        
    def after_login(self, response):
        accview = response.xpath('//ul[@class="nav navbar-nav navbar-secondary navbar-right"]//a/@href')[13]
        if accview:
            print('success')
        else:
            print(':(')
            
        for url in self.start_urls:
            yield response.follow(url=url, callback=self.search_links)
            
    def search_links(self, response):
        link = response.xpath('//ul[@class="nav navbar-nav navbar-secondary navbar-right"]/li/a/@href').get()
        
        for a in link:
            link = response.url
            yield response.follow(url=link, callback=self.parse_page)
                   

    def parse_page(self, response):
        productpage = response.xpath('//div[@class="product-details col-md-12"]')
        
        if not productpage:
            print('No productlink', response.url)
            
        for a in productpage:
            items = DuifItem()
            items['Link'] = response.url
            yield items

Unfortunately i cant provide a dummyaccount, where you can try login by yourself, because its a b2b-service website.

I can imagine that my "def search_links" is wrong.

My planned structure is:

visit login_page, pass my login credentials
check if logged in by xpath, where it checks, if the logout button is given or not.
If logged in, it prints 'success'
Given by xpath expression, it should start to follow links by:
by visiting every link, it should check by xpath xpression, if specific container is given or not, so it knows whether its a productpage or not.
if product page, save visited link, if not productpage, take next link

Console output

Like you can see, the authentication is working, but it wont do anything afterwards.

Update

i reworked my code a very bit:

import scrapy
from ..items import DuifItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import FormRequest, Request
from scrapy_splash import SplashRequest

class DuifLinkSpider(CrawlSpider):
    name = 'duiflink'
    allowed_domains = ['duif.nl']
    login_page = 'https://www.duif.nl/login'
    start_urls = ['https://www.duif.nl/']
    custom_settings = {'FEED_EXPORT_FIELDS' : ['Link']}
    
    def start_requests(self):
        yield SplashRequest(
        url=self.login_page,
        callback=self.parse_login,
        args={'wait': 3},
        dont_filter=True    
        )
    
    rules = (
       Rule(LinkExtractor(), callback='parse_login', follow=True), 
    )
   
    def parse_login(self, response):
        return FormRequest.from_response(
            response,
            formid='login-form',
            formdata={
                'username' : 'not real',
                'password' : 'login data'},
            clickdata={'type' : 'submit'}, 
            callback=self.after_login)
        
    def after_login(self, response):
        accview = response.xpath('//ul[@class="nav navbar-nav navbar-secondary navbar-right"]//a/@href')[13]
        if accview:
            print('success')
        else:
            print(':(')
            
        for url in self.start_urls:
            yield response.follow(url=url, callback=self.search_links, dont_filter=True)
            
    def search_links(self, response):
        # link = response.xpath('//ul[@class="nav navbar-nav navbar-secondary navbar-right"]/li/a/@href')
        link = response.xpath('//a/@href')
        
        for a in link:
            link = a.get()
            link = 'https://www.duif.nl' + link if link else link
            yield response.follow(url=link, callback=self.parse_page, dont_filter=True)

    def parse_page(self, response):
        productpage = response.xpath('//div[@class="product-details col-md-12"]')
        
        if not productpage:
            print('No productlink', response.url)
            
        for a in productpage:
            items = DuifItem()
            items['Link'] = response.url
            yield items

Now i know, that i am definitely logged in, but it doesnt follow the "sub"-links, but i thought if i use response.xpath('//a/@href') , it will automatically searches the whole dom for every link.

Below my new console output

Solution

After you login, you go back to parsing your start url. Scrapy filters out duplicate requests by default, so in your case it stops here. You can avoid this by using 'dont_filter=True' in your request, like this:

yield response.follow(url=url, callback=self.search_links, dont_filter=True)

Answered By - Wim Hermans

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 26, 2021

[FIXED] Scrapy crawl every link after authentication

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels