Issue
I am trying to get Scrapy to scrape a local file, not a URL website using HTTPS. I got some errors related to the robots.txt file:
2020-07-13 23:58:43 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET file:///robots.txt> (failed 3 times): [Errno 2] No such file or directory: '/robots.txt'
2020-07-13 23:58:43 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET file:///robots.txt>: [Errno 2] No such file or directory: '/robots.txt'
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 151, in maybeDeferred
result = f(*args, **kw)
File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/handlers/file.py", line 15, in download_request
with open(filepath, 'rb') as fo:
IOError: [Errno 2] No such file or directory: '/robots.txt'
I found a post of a similar problem on StackOverFlow:
How to crawl local HTML file with Scrapy
But the solution in that post says that the problem is because of the allowed_domains variable. I don't define this variable at all in my spider:
class TestSpider(scrapy.Spider):
name = "test_schedule"
season_flag = False
season_val = ""
"""
I need to override the __init__() method of scrapy.Spider
because I need to define some attributes/variables from run-time arguments
"""
def __init__(self, *a, **kw):
super(TestSpider, self).__init__(*a, **kw)
self.season_flag = False
self.debug_flag = False
self.season_val = ""
# Get some run-time arguments
if hasattr(self, "season"):
self.season_val = str(self.season)
self.season_flag = True
"""
Note: I never define an allowed_domains list
anywhere in start_requests()
"""
def start_requests(self):
schedule_filename = "home/foo.html"
# I check to see that the file 'foo.html' exists.
# The file exists but I still get a "robot.txt not found" error
if not os.path.exists(schedule_filename):
stmt = "test file doesn't exist"
self.log(stmt)
sys.exit(1)
else:
stmt = " *** test file exists ***"
self.log(stmt)
url_list = [
"file:///home/foo.html"
]
for url in url_list:
yield scrapy.Request(url=url,\
callback=self.parse_schedule_page)
"""
Method that will parse the response from
the scrapy.Request call.
"""
def parse_schedule_page(self, response):
game_elements_list = response.xpath("//table[@id = 'games']/tbody/tr")
num_game_elements = len(game_elements_list)
# etc., etc., etc. but the program flow doesn't even get here
Do I have to set some configuration setting or use a runtime argument to let Scrapy know that I'm pointing Scrapy to a local file? The link I referenced doesn't mention anything about this.
Scrapy's error message says that it cannot find robot.txt. Since I am using "file:///" instead of "https", shouldn't Scrapy not even look for a robots.txt file?
Solution
The problem is caused by the RobotsTxtMiddleware
trying to download a robots.txt
, it can be solved by disabling the middleware. You can set you settings.py
to
ROBOTSTXT_OBEY=False
This will cause the middleware to be disabled, as it raises a NotConfigured
exception. (source)
Another way to disable this middleware (and every other builtin middleware) is to set it to None
in DOWNLOADER_MIDDLEWARES
(inside settings.py
) as mentioned in the docs:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.robotstxt': None,
}
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.