Sunday, January 30, 2022

[FIXED] How to use Scrapy on local files without getting the robot.txt errors?

January 30, 2022 python, scrapy No comments

Issue

I am trying to get Scrapy to scrape a local file, not a URL website using HTTPS. I got some errors related to the robots.txt file:

2020-07-13 23:58:43 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET file:///robots.txt> (failed 3 times): [Errno 2] No such file or directory: '/robots.txt'
2020-07-13 23:58:43 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET file:///robots.txt>: [Errno 2] No such file or directory: '/robots.txt'
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
  File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 151, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/lib64/python2.7/site-packages/scrapy/core/downloader/handlers/file.py", line 15, in download_request
    with open(filepath, 'rb') as fo:
IOError: [Errno 2] No such file or directory: '/robots.txt'

I found a post of a similar problem on StackOverFlow:

How to crawl local HTML file with Scrapy

But the solution in that post says that the problem is because of the allowed_domains variable. I don't define this variable at all in my spider:

class TestSpider(scrapy.Spider):
    name = "test_schedule"

    season_flag = False
    season_val = ""

    """
    I need to override the __init__() method of scrapy.Spider
    because I need to define some attributes/variables from run-time arguments
    """
    def __init__(self, *a, **kw):
        super(TestSpider, self).__init__(*a, **kw)
        self.season_flag = False
        self.debug_flag = False
        self.season_val = ""

        # Get some run-time arguments
        if hasattr(self, "season"):
            self.season_val = str(self.season)
            self.season_flag = True


    """
    Note: I never define an allowed_domains list
    anywhere in start_requests()
    """
    def start_requests(self):
        schedule_filename = "home/foo.html"

        # I check to see that the file 'foo.html' exists.
        # The file exists but I still get a "robot.txt not found" error
        if not os.path.exists(schedule_filename):
            stmt = "test file doesn't exist"
            self.log(stmt)
            sys.exit(1)
        else:
            stmt = " *** test file exists ***"
            self.log(stmt)

        url_list = [
            "file:///home/foo.html"
        ]

        for url in url_list:
            yield scrapy.Request(url=url,\
              callback=self.parse_schedule_page)

    """
    Method that will parse the response from
    the scrapy.Request call.
    """
    def parse_schedule_page(self, response):

        game_elements_list = response.xpath("//table[@id = 'games']/tbody/tr")
        num_game_elements = len(game_elements_list)

        # etc., etc., etc. but the program flow doesn't even get here

Do I have to set some configuration setting or use a runtime argument to let Scrapy know that I'm pointing Scrapy to a local file? The link I referenced doesn't mention anything about this.

Scrapy's error message says that it cannot find robot.txt. Since I am using "file:///" instead of "https", shouldn't Scrapy not even look for a robots.txt file?

Solution

The problem is caused by the RobotsTxtMiddleware trying to download a robots.txt, it can be solved by disabling the middleware. You can set you settings.py to

ROBOTSTXT_OBEY=False

This will cause the middleware to be disabled, as it raises a NotConfigured exception. (source)

Another way to disable this middleware (and every other builtin middleware) is to set it to None in DOWNLOADER_MIDDLEWARES (inside settings.py) as mentioned in the docs:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.robotstxt': None,
}

Answered By - renatodvc

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 30, 2022

[FIXED] How to use Scrapy on local files without getting the robot.txt errors?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels