Issue
I tried to crawl a local HTML file stored in my desktop with the code below, but I encounter the following errors before crawling procedure, such as "No such file or directory: '/robots.txt'".
- Is it possible to crawl local HTML files in a local computer(Mac)?
- If possible, how should I set parameters like "allowed_domains" and "start_urls"?
[Scrapy command]
$ scrapy crawl test -o test01.csv
[Scrapy spider]
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = []
start_urls = ['file:///Users/Name/Desktop/test/test.html']
[Errors]
2018-11-16 01:57:52 [scrapy.core.engine] INFO: Spider opened
2018-11-16 01:57:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-16 01:57:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-11-16 01:57:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 1 times): [Errno 2] No such file or directory: '/robots.txt'
2018-11-16 01:57:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 2 times): [Errno 2] No such file or directory: '/robots.txt'
Solution
When working with it locally, I never specify the allowed_domains
.
Try to take that line of code out and see if it works.
In your error its testing the 'empty' domain that you have given it.
Answered By - Japes
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.