Issue
I have a Scrapy script which looks like this:
main.py
import os
import argparse
import datetime
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from spiders.mySpider import MySpider
parser = argparse.ArgumentParser(description='My Scrapper')
parser.add_argument('-v',
'--verbose',
help='Verbose mode',
action='store_true')
parser.add_argument('-t',
'--type',
help='Type',
type=str)
args = parser.parse_args()
if args.type != 'expected':
parser.error("Wrong type")
if __name__ == "__main__":
settings = get_project_settings()
settings['LOG_ENABLED'] = args.verbose
process = CrawlerProcess(settings=settings)
process.crawl(MySpider, type_arg=args.type)
process.start()
mySpider.py
from scrapy import Spider
from scrapy.http import Request, FormRequest
import scrapy.exceptions as ScrapyExceptions
class MySpider(Spider):
name = 'MyScrapper'
allowed_domains = ['www.webtoscrape.com']
start_urls = ['http://www.webtoscrape.com/path/to/page.html']
def parse(self, response):
# ...
# Some logic
# ...
if condition:
raise ScrapyExceptions.UsageError(reason="Wrong argument")
When I raise a parser.error()
on the main.py
file, my process returns a non-zero exit code as expected. However, when I raise a scrapy.exceptions.UsageError()
on the mySpider.py
file, I receive a 0 exit code, so the Jenkins pipeline step I run my script on thinks it has succeded and continues with the pipeline execution. I run my script with a python3 main.py --type my_type
command.
Why the script execution doesn't notice that the usage error raised on the mySpider.py
module should return a non-zero exit code?
Solution
After several hours of trying approaches I found this thread. The problem is that Scrapy does not use a non-zero exit code when a scrape fails. I managed to fix this behaviour by using the Crawler stats collection.
main.py
if __name__ == "__main__":
settings = get_project_settings()
settings['LOG_ENABLED'] = args.verbose
process = CrawlerProcess(settings=settings)
process.crawl(MySpider, type_arg=args.type)
crawler = list(process.crawlers)[0]
process.start()
failed = crawler.stats.get_value('custom/failed_job')
if failed:
sys.exit(1)
mySpider.py
class MySpider(Spider):
name = 'MyScrapper'
allowed_domains = ['www.webtoscrape.com']
start_urls = ['http://www.webtoscrape.com/path/to/page.html']
def parse(self, response):
# ...
# Some logic
# ...
if condition:
self.crawler.stats.set_value('custom/failed_job', 'True')
raise ScrapyExceptions.UsageError(reason="Wrong argument")
Answered By - Luiscri
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.