Issue
I would like to scrape a website. However I want to make sense of the robots.txt before I do. The lines that I don't understand are
User-agent: *
Disallow: /*/*/*/*/*/*/*/*/
Disallow: /*?&*&*
Disallow: /*?*&*
Disallow: /*|*
Does the User Agent Line mean access is ok anywhere? But then I have the Disallow line which is the main one I am concerned about. Does it mean don't access 8 layers deep, or don't access at all?
Solution
I believe one simply interpret the robot.txt
file with regex
. The star can usually be interpreted as anything/everything.
The User-Agent line User-agent: *
does not mean you are allowed to scrape everything, it simply means the following rules apply to all user-agents. Here are examples of User-Agents
# Chrome Browser
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36
# Python requests default
python-requests/2.19.1
which must comply with the same rules, that is:
For example
Disallow: /*?*&*
means you are not allowed to scrape sub-domains of the form/some_sub_domain?param_name=param_value
.Or the line
/*/*/*/*/*/*/*/*/
means the sub-domains of the following form are not allowed to be scraped/a/b/c/d/e/f/g/i/
Finally, here are insightful examples and more on the topic.
Answered By - niko
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.