Issue
I have CrawlSpider that takes multiple URLs as start_urls
and then I have several Rules for them. One of the Rules is used to handle pagination:
class MySpider(CrawlSpider):
name = 'spdr'
start_urls = [
i.strip() for i in open('list.txt', 'r').read().splitlines() if i and i.strip()
]
rules = [
Rule(
LinkExtractor(
allow=[],
restrict_css=[
"a[rel='next']",
"a[href*='nav_top-next']",
],
),
follow=True,
),
...
The problem is that one of the websites redirects spider to the initial URL (https://piwi.wiesbaden.de/gremium/detail/1/mitgliederaktuell?0
) when it tries to paginate and in that case it is dupefiltered and not crawled:
2023-01-13 21:35:51 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://piwi.wiesbaden.de/gremium/detail/1/mitgliederaktuell?0> from <GET https://piwi.wiesbaden.de/gremium/detail/1/mitgliederaktuell;jsessionid=F45A1D3CC4EF7F8D27A606FD4521685B?0-1.-list-nav_top-next>
So, I need to avoid filtering duplicates in one specific domain. Is it possible to do? Please advise.
Solution
You can use the process_request
parameter for your Rule
to set a function that checks each request for the specific url, and if found then you can set the requests dont_filter
argument to True.
for example:
def check_url(request, response):
if request.url == MySpider.url_you_dont_want_to_filter:
request.dont_filter = True
return request
class MySpider(CrawlSpider):
name = 'spdr'
url_you_dont_want_to_filter = 'https://www.example.com/dont_filter_me' # added this
start_urls = [
i.strip() for i in open('list.txt', 'r').read().splitlines() if i and i.strip()
]
rules = [
Rule(
LinkExtractor(
allow=[],
restrict_css=[
"a[rel='next']",
"a[href*='nav_top-next']",
],
),
follow=True,
process_request=check_url, # added this
),
...
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.