Issue
I am running a very simple scrapy loop to query several times in a row https://api.ipify.org/
class IpSpider(scrapy.Spider):
name = "ip"
n = 0
use_proxy = True
def start_requests(self):
yield scrapy.Request(
"https://api.ipify.org/",
callback=self.parse_ip
)
def parse_ip(self, response):
if self.n < 10:
self.n += 1
self.logger.info(self.n)
self.logger.info(response.body)
yield scrapy.Request(
"https://api.ipify.org/",
callback=self.parse_ip
)
I expect that It will log something like
1
ip
2
ip
3
...
but the log looks like :
2022-09-01 08:43:38 [ip] INFO: 1
2022-09-01 08:43:38 [ip] INFO: b'ip'
2022-09-01 08:43:38 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-01 08:43:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
Please note that I am using a middleware which routes my requests through a proxy. It looks like :
def process_request(self, request, spider):
if spider.use_proxy:
request.meta['proxy'] = 'proxy_ip:proxy_port'
return None
Why does the scraper interupts the for loop ?
Solution
Actualy this is a normal behavior from Scrapy which fitlers out duplicate requests.
In order to allow for duplicates the request should include dont_filter=True
yield scrapy.Request(
"https://api.ipify.org/",
callback=self.parse_ip,
dont_filter=True
)
Answered By - samuel guedon
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.