Wednesday, November 15, 2023

[FIXED] How to remove dupefilter on specific domain using CrawlSpider and Rules?

November 15, 2023 scrapy No comments

Issue

I have CrawlSpider that takes multiple URLs as start_urls and then I have several Rules for them. One of the Rules is used to handle pagination:

class MySpider(CrawlSpider):
    name = 'spdr'

    start_urls = [
        i.strip() for i in open('list.txt', 'r').read().splitlines() if i and i.strip()
    ]
    rules = [
        Rule(
            LinkExtractor(
                allow=[], 
                restrict_css=[
                    "a[rel='next']",
                    "a[href*='nav_top-next']",
                ],
                
            ),
            follow=True,
        ),
    ...

The problem is that one of the websites redirects spider to the initial URL (https://piwi.wiesbaden.de/gremium/detail/1/mitgliederaktuell?0) when it tries to paginate and in that case it is dupefiltered and not crawled:

2023-01-13 21:35:51 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://piwi.wiesbaden.de/gremium/detail/1/mitgliederaktuell?0> from <GET https://piwi.wiesbaden.de/gremium/detail/1/mitgliederaktuell;jsessionid=F45A1D3CC4EF7F8D27A606FD4521685B?0-1.-list-nav_top-next>

So, I need to avoid filtering duplicates in one specific domain. Is it possible to do? Please advise.

Solution

You can use the process_request parameter for your Rule to set a function that checks each request for the specific url, and if found then you can set the requests dont_filter argument to True.

for example:

def check_url(request, response):
    if request.url == MySpider.url_you_dont_want_to_filter:
        request.dont_filter = True
    return request

class MySpider(CrawlSpider):
    name = 'spdr'
    url_you_dont_want_to_filter = 'https://www.example.com/dont_filter_me'  # added this

    start_urls = [
        i.strip() for i in open('list.txt', 'r').read().splitlines() if i and i.strip()
    ]
    rules = [
        Rule(
            LinkExtractor(
                allow=[], 
                restrict_css=[
                    "a[rel='next']",
                    "a[href*='nav_top-next']",
                ],
                
            ),
            follow=True,
            process_request=check_url,   # added this
        ),
    ...

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 15, 2023

[FIXED] How to remove dupefilter on specific domain using CrawlSpider and Rules?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels