Issue
The first action in my parse
method is to extract a dictionary from a JSON string contained in the HTML. I've noticed that I sometimes get an error as the web page doesn't display correctly and thus doesn't contain the JSON string. If I rerun the spider then the same page displays fine and on it carries on until another random JSON error.
I'd like to check that I've got the error handling correct:
def parse(self, response):
json_str = response.xpath("<xpath_to_json>").get()
try:
items = json.loads(json_str)["items"]
except JSONDecodeError:
return response.follow(url=response.url, callback=self.parse)
for i in items:
# do stuff
I'm pretty sure this will work ok but wanted to check check a couple of things:
- If this hits a 'genuinely bad' page where there is no JSON will the spider get stuck in a loop or does scrapy give up after trying a given URL a certain number of times?
- I've used a
return
instead of ayield
because I don't want to continue running the method. Is this ok?
Any other comments are welcome too!!
Solution
I think return
when getting decoding error in your case should be ok as the scraper is not iterating through the scraped results. I think normally response.follow
and Request
would filter out duplicated requests so you would need to include dont_filter=True
when calling them to allow duplicated url requests. To configure a n
number of retry, it's not the cleanest approach but you could keep a dictionary to keep track of retry attempt counts for certain url as self
property (self.retry_count
in below code), increase it every time the url request is parsed and stop when a limit number is hit.
import json
from json import JSONDecodeError
import scrapy
class TestSpider(scrapy.Spider):
name = "test"
def start_requests(self):
urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/"
]
for url in urls:
self.retry_count = {k:0 for k in urls}
self.retry_limit = 3
yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
def parse(self, response):
self.retry_count[response.url] += 1
json_str = "{\"items\": 1" # intentionally trigger json decode error
print(f'===== RUN {response.url}; Attempt: {self.retry_count} =====')
try:
items = json.loads(json_str)["items"]
except JSONDecodeError as ex:
print("==== ERROR ====")
if self.retry_count[response.url] == self.retry_limit:
raise ex
else:
return response.follow(url=response.url, callback=self.parse, dont_filter=True)
self.retry_count[response.url] = 0 # reset attempt as parse successful
Answered By - tax evader
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.