Sunday, January 2, 2022

[FIXED] Scrapy: How to use init_request and start_requests together?

January 02, 2022 python-3.x, scrapy No comments

Issue

I need to make an initial call to a service before I start my scraper (the initial call, gives me some cookies and headers), I decided to use InitSpider and override the init_request method to achieve this. however I also need to use start_requests to build my links and add some meta values like proxies and whatnot to that specific spider, but I'm facing a problem. whenever I override start_requests, my crawler doesn't call init_request anymore and I can not do the initialization and in order to get init_request working is to not override the start_requests method which is impossible in my case. any suggestions or possible solutions to my code:

class SomethingSpider(InitSpider):
name = 'something'
allowed_domains = ['something.something']

aod_url = "https://something?="


start_urls = ["id1","id2","id3"]

custom_settings = {
    'DOWNLOAD_FAIL_ON_DATALOSS' : False,
    'CONCURRENT_ITEMS': 20,
    'DOWNLOAD_TIMEOUT': 10,
    'CONCURRENT_REQUESTS': 3,
    'COOKIES_ENABLED': True,
    'CONCURRENT_REQUESTS_PER_DOMAIN': 20
}

def init_request(self):
    yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})


def check_temp_cookie(self, response):
    """Check the response returned by a login request to see if we are
    successfully logged in.
    """
    if response.status == 200:
        print("H2")
        # Now the crawling can begin..
        return self.initialized()
    else:
        print("H3")
        # Something went wrong, we couldn't log in, so nothing happens.


def start_requests(self):
    print("H4")
    proxies = ["xyz:0000","abc:1111"]
    for url in self.start_urls:
        yield scrapy.Request(url=self.aod_url+url, callback=self.parse, meta={'proxy': random.choice(proxies)})



def parse(self, response):
            #some processing happens

            yield {
            #some data
            }

        except Exception as err:
            print("Connecting to...")

Solution

Spiders page (generic spiders section) on official scrapy docs doesn't have any mention of InitSpider You are trying to use.

InitSpider class from https://github.com/scrapy/scrapy/blob/2.5.0/scrapy/spiders/init.py written ~10 years ago (at that... ancient versions of scrapy start_requests method worked completely differently).

From this perspective I recommend You to not use undocumented and probably outdated InitSpider.

On current versions of scrapy required functionality can be implemented using regular Spider class:

import scrapy

class SomethingSpider(scrapy.Spider):
...
    def start_requests(self):
        yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})

    def check_temp_cookie(self, response):
    """Check the response returned by a login request to see if we are
    successfully logged in.
    """
        if response.status == 200:
            print("H2")
            # Now the crawling can begin..
            ...
            #Schedule next requests here:
            for url in self.start_urls:
                yield scrapy.Request(url=self.aod_url+url, callback=self.parse, ....})
        else:
            print("H3")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
        ...

Answered By - Georgiy

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 2, 2022

[FIXED] Scrapy: How to use init_request and start_requests together?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels