Issue
Imagine I am crawling foo.com
. foo.com has several internal links to itself, and it has some external links like:
foo.com/hello
foo.com/contact
bar.com
holla.com
I would like scrapy to crawl all the internal links but also only one depth for external links such as I want scrapy to go to bar.com
or holla.com
but I dont want it to go any other link within bar.com
so only depth of one.
is this possible? What would be the config for this case?
Thanks.
Solution
You can base your spider on CrawlSpider
class and use Rule
s with implemented process_links
method, that you pass to the Rule
. That method will filter unwanted links before they get followed. From the documentation:
process_links
is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specifiedlink_extractor
. This is mainly used for filtering purposes.
Answered By - Tomáลก Linhart
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.