Issue
I need to make an initial call to a service before I start my scraper (the initial call, gives me some cookies and headers), I decided to use InitSpider and override the init_request
method to achieve this. however I also need to use start_requests
to build my links and add some meta values like proxies and whatnot to that specific spider, but I'm facing a problem. whenever I override start_requests
, my crawler doesn't call init_request
anymore and I can not do the initialization and in order to get init_request working is to not override the start_requests
method which is impossible in my case. any suggestions or possible solutions to my code:
class SomethingSpider(InitSpider):
name = 'something'
allowed_domains = ['something.something']
aod_url = "https://something?="
start_urls = ["id1","id2","id3"]
custom_settings = {
'DOWNLOAD_FAIL_ON_DATALOSS' : False,
'CONCURRENT_ITEMS': 20,
'DOWNLOAD_TIMEOUT': 10,
'CONCURRENT_REQUESTS': 3,
'COOKIES_ENABLED': True,
'CONCURRENT_REQUESTS_PER_DOMAIN': 20
}
def init_request(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
return self.initialized()
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def start_requests(self):
print("H4")
proxies = ["xyz:0000","abc:1111"]
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, meta={'proxy': random.choice(proxies)})
def parse(self, response):
#some processing happens
yield {
#some data
}
except Exception as err:
print("Connecting to...")
Solution
Spiders page (generic spiders section) on official scrapy docs doesn't have any mention of InitSpider
You are trying to use.
InitSpider
class from https://github.com/scrapy/scrapy/blob/2.5.0/scrapy/spiders/init.py written ~10 years ago (at that... ancient versions of scrapy start_requests
method worked completely differently).
From this perspective I recommend You to not use undocumented and probably outdated InitSpider
.
On current versions of scrapy required functionality can be implemented using regular Spider
class:
import scrapy
class SomethingSpider(scrapy.Spider):
...
def start_requests(self):
yield scrapy.Request(url="https://something",callback=self.check_temp_cookie, meta={'proxy': 'someproxy:1111'})
def check_temp_cookie(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if response.status == 200:
print("H2")
# Now the crawling can begin..
...
#Schedule next requests here:
for url in self.start_urls:
yield scrapy.Request(url=self.aod_url+url, callback=self.parse, ....})
else:
print("H3")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
...
Answered By - Georgiy
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.