Issue
I discovered yesterday that Scrapy respects the robots.txt file by default (ROBOTSTXT_OBEY = True
).
If I request an URL with scrapy shell url
, and if I have a response, does it mean that url
is not protected by robots.txt?
Solution
According to the docs, it's enabled by default only when you create a project using scrapy startproject
command, otherwise should be default False
.
https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots
Answering your question, yes, scrapy shell
command does respect robots.txt
configuration defined in settings.py
. If ROBOTSTXT_OBEY = True
, trying to use scrapy shell
command on a protected URL will generate a response None
.
You can also test it passing robots.txt settings via command line:
scrapy shell https://www.netflix.com --set="ROBOTSTXT_OBEY=True"
Answered By - Marcos
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.