Saturday, July 30, 2022

[FIXED] Make CrawlSpider Process data on HomePage + Other Extracted Links

July 30, 2022 scrapy No comments

Issue

I am doing a broad crawl. I need to process a few pages per website in order to set values for one of about 20 classification rules. For example, one classification rule is "Has Phone Number" (runs a regex to see if there is a phone number in the page source and returns a boolean). The rules are implemented in a function called parse_page().

I need the CrawlSpider to run parse_page() on the homepage of each crawled website, as well as other common pages such the about page, contact page, privacy policy page, etc.

When I run the spider, it starts with some-site.com and grabs the pages according to the Rule definitions in the code below.

My problem lies in that I need parse_page() to run on the homepage of some-site.com (parse_page() currently only runs on some-site.com/contact-us, some-site.com/about-us, etc.). My question then is: How do I specify a Rule() to include the homepage of the site, so that parse_page() will get called for the homepage as well as the other pages already included?

class SomeBotSpider(scrapy.spiders.CrawlSpider):
     name = 'some_bot'
     allowed_domains = ['some-site.com']
     start_urls = ['https://some-site.com/']


      rules = (
         Rule(LinkExtractor(allow='/contact'), callback='parse_page', follow=True),
         Rule(LinkExtractor(allow='disclaimer'), callback='parse_page', follow=True),
         Rule(LinkExtractor(allow='disclosure'), callback='parse_page', follow=True),
         Rule(LinkExtractor(allow='/about'), callback='parse_page', follow=True),
         Rule(LinkExtractor(allow='privacy'), callback='parse_page', follow=True),
     )

Solution

You need to override the parse_start_url method of the spider. The first request to website(homepage url) is handled by parse_start_url method. You can call parse_page inside the parse_start_url method. Something like this:

class SomeBotSpider(scrapy.spiders.CrawlSpider):
    name = 'some_bot'
    allowed_domains = ['some-site.com']
    start_urls = ['https://some-site.com/']


    rules = (
         Rule(LinkExtractor(allow='/contact'), callback='parse_page', follow=True),
         Rule(LinkExtractor(allow='disclaimer'), callback='parse_page', follow=True),
         Rule(LinkExtractor(allow='disclosure'), callback='parse_page', follow=True),
         Rule(LinkExtractor(allow='/about'), callback='parse_page', follow=True),
         Rule(LinkExtractor(allow='privacy'), callback='parse_page', follow=True),
     )

    def parse_start_url(self, response):
        return self.parse_page(response)

Answered By - asimhashmi

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, July 30, 2022

[FIXED] Make CrawlSpider Process data on HomePage + Other Extracted Links

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels