Issue
I'm working on a code that must read and process date and time information from a remote Json file at any time. The code I wrote is as follows:
import scrapy
class TimeSpider(scrapy.Spider):
name = 'getTime'
allowed_domains = ['worldtimeapi.org']
start_urls = ['http://worldtimeapi.org']
def parse(self,response):
time_json='http://worldtimeapi.org/api/timezone/Asia/Tehran'
for i in range(5):
print(i)
yield scrapy.Request(url=time_json, callback=self.parse_json)
def parse_json(self,response):
print(response.json())
And the output it gives is as follows:
0
1
2
3
4
{'abbreviation': '+0430', 'client_ip': '45.136.231.43', 'datetime': '2022-04-22T22:01:44.198723+04:30', 'day_of_week': 5, 'day_of_year': 112, 'dst': True, 'dst_from': '2022-03-21T20:30:00+00:00', 'dst_offset': 3600, 'dst_until': '2022-09-21T19:30:00+00:00', 'raw_offset': 12600, 'timezone': 'Asia/Tehran', 'unixtime': 1650648704, 'utc_datetime': '2022-04-22T17:31:44.198723+00:00', 'utc_offset': '+04:30', 'week_number': 16}
As you can see, the program only calls the parse_json function once, while it has to call the function in every loop
Can anyone help me solve this problem?
Solution
Additional requests are being dropped by scrapy's default duplicates filter.
The simplest way to avoid this is to pass the dont_filter
argument:
yield scrapy.Request(url=time_json, callback=self.parse_json, dont_filter=True)
From the docs:
dont_filter (bool) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to
False
.
Answered By - stranac
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.