Issue
I have issue to access some websites that returns HTTP 500 code along with correctly formatted HTML page.
So, I can download page with Chorme/Firefox, but I can't do it with Scrapy.
Scrapy logs:
2020-04-10 15:57:16 [scrapy.core.engine] INFO: Spider opened
2020-04-10 15:57:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-10 15:57:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-04-10 15:57:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 1 times): 500 Internal Server Error
2020-04-10 15:57:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 2 times): 500 Internal Server Error
2020-04-10 15:57:20 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (failed 3 times): 500 Internal Server Error
2020-04-10 15:57:20 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html> (referer: None)
2020-04-10 15:57:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html>: HTTP status code is not handled or not allowed
Please see screen shot below that shows that webserver returns HTTP 500 along with Web page that correctly rendered in Firefox.
Test page is https://www.industrialmotors.com/products/toshiba-motors/where/p/1.html
Thank you, let me know if I should add any details.
Solution
If you want to handle that on only one spider:
class MySpider(Spider):
...
handle_httpstatus_list = [500]
If you want to handle that on only one request:
...
def my_parse_method(self, response):
...
yield Request(url='http://example.com', meta={'handle_httpstatus_list': [500]})
Answered By - eLRuLL
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.