Wednesday, June 15, 2022

[FIXED] CrawlSpider Rule not working for a link that actually exists in target site

June 15, 2022 scrapy, web-crawler, web-scraping No comments

Issue

I'm stuck trying to find a way to make my spider work. This is the scenario: I'm trying to find all the URLs of a specific domain that are contained in a particular target website. For this, I've defined a couple of rules so I can crawl the site and find out the links of my interest.

The thing is that it doesn't seem to work, even when I know that there are links with the proper format inside the website.

This is my spider:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class sp(CrawlSpider):
     name = 'sp'

     start_urls = ['https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/']

     custom_settings = {
        'LOG_LEVEL': 'INFO',
        'DEPTH_LIMIT': 4
     }

    rules = (
        Rule(LinkExtractor(unique=True, allow_domains='a2zinc.net'), callback='parse_item'),
        Rule(LinkExtractor(unique=True, canonicalize=True, allow_domains = 'nationalpavementexpo.com'))
    )


    def parse_item(self, response):
        print(response.request.url)
        yield {'link':response.request.url}

So, in summary, I'm trying to find all the links from 'a2zinc.net' contained inside https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/ and its subsections.

As you guys can see, there are at least 3 occurrences of the desired links inside the target website.

The funny thing is that when I test the spider using another target site (like this one) that also contains links of interest, it works as expected and I can't really see the difference.

Also, if I define a Link Extractor instance (as in the snippet below) inside a parsing method, it is also capable of finding the desired links, but I think this won't be the best way of using CrawlSpider + Rules.

def parse_item(self, response):
    le = LinkExtractor(allow_domains='a2zinc.net')
    links = le.extract_links(response)

    for link in links:
        yield {'link': link.url}

any idea what the cause of the problem could be?

Thanks a lot.

Solution

Your code works. The only issue is that you have set the logging level to INFO while the links that are being extracted are returning status code 403 which is only visible at the DEBUG level. Comment out your custom settings and you will see that the links are being extracted.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class sp(CrawlSpider):
    name = 'sp'

    start_urls = ['https://nationalpavementexpo.com/show/floor-plan-exhibitor-list/']

    custom_settings = {
    # 'LOG_LEVEL': 'INFO',
    # 'DEPTH_LIMIT': 4
    }

    rules = (
        Rule(LinkExtractor(allow_domains='a2zinc.net'), callback='parse_item'),
        Rule(LinkExtractor(unique=True, canonicalize=True, allow_domains = 'nationalpavementexpo.com'))
    )


    def parse_item(self, response):
        print(response.request.url)
        yield {'link':response.request.url}

OUTPUT:

Answered By - msenior_

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 15, 2022

[FIXED] CrawlSpider Rule not working for a link that actually exists in target site

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels