Issue
I am doing a broad crawl. I need to process a few pages per website in order to set values for one of about 20 classification rules. For example, one classification rule is "Has Phone Number" (runs a regex to see if there is a phone number in the page source and returns a boolean). The rules are implemented in a function called parse_page().
I need the CrawlSpider to run parse_page() on the homepage of each crawled website, as well as other common pages such the about page, contact page, privacy policy page, etc.
When I run the spider, it starts with some-site.com and grabs the pages according to the Rule definitions in the code below.
My problem lies in that I need parse_page() to run on the homepage of some-site.com (parse_page() currently only runs on some-site.com/contact-us, some-site.com/about-us, etc.). My question then is: How do I specify a Rule() to include the homepage of the site, so that parse_page() will get called for the homepage as well as the other pages already included?
class SomeBotSpider(scrapy.spiders.CrawlSpider):
name = 'some_bot'
allowed_domains = ['some-site.com']
start_urls = ['https://some-site.com/']
rules = (
Rule(LinkExtractor(allow='/contact'), callback='parse_page', follow=True),
Rule(LinkExtractor(allow='disclaimer'), callback='parse_page', follow=True),
Rule(LinkExtractor(allow='disclosure'), callback='parse_page', follow=True),
Rule(LinkExtractor(allow='/about'), callback='parse_page', follow=True),
Rule(LinkExtractor(allow='privacy'), callback='parse_page', follow=True),
)
Solution
You need to override the parse_start_url
method of the spider. The first request to website(homepage url) is handled by parse_start_url
method. You can call parse_page
inside the parse_start_url
method.
Something like this:
class SomeBotSpider(scrapy.spiders.CrawlSpider):
name = 'some_bot'
allowed_domains = ['some-site.com']
start_urls = ['https://some-site.com/']
rules = (
Rule(LinkExtractor(allow='/contact'), callback='parse_page', follow=True),
Rule(LinkExtractor(allow='disclaimer'), callback='parse_page', follow=True),
Rule(LinkExtractor(allow='disclosure'), callback='parse_page', follow=True),
Rule(LinkExtractor(allow='/about'), callback='parse_page', follow=True),
Rule(LinkExtractor(allow='privacy'), callback='parse_page', follow=True),
)
def parse_start_url(self, response):
return self.parse_page(response)
Answered By - asimhashmi
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.