Saturday, July 30, 2022

[FIXED] Scrapy CrawlSpider iterating through entire site

July 30, 2022 scrapy No comments

Issue

I have a simple CrawlSpider that crawls the first page of a specific website. I want to have the spider go ahead with ?p=1, ?p=2 and so on until it detects the end of site-iteration. How can I do that?

class PomosCrawlSpider(CrawlSpider):
    name = 'crawlobituaries'
    
    allowed_domains = ['some.at']
    start_urls = [
        'https://www.some.at',
    ]

    rules = (
        Rule(LinkExtractor(allow='traueranzeigen'), callback='parse_obi'),
    )

    def parse_obi(self, response):
        logging.info(response.url)

        for post in response.css('.homepage_unterseiten_layout_todesanzeigen'):
            for entry in post.css('a'):
                item = {
                    'name': entry.css('.homepage_unterseiten_layout_titel::text').get(),
                    'date': entry.css('.homepage_unterseiten_layout_datum::text').get()
                }
                yield item

Solution

The reason your spider is only scraping the first page is because you have not added follow=True in your Rule definition so that the spider follows the links and extracts further links. You also need to add a Rule definition to follow the next pages which can be defined using restrict_css method and include the class on the navigation div. See below sample code.

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import logging

class PomosCrawlSpider(CrawlSpider):
    name = 'crawlobituaries'

    allowed_domains = ['bestattung-aichinger.at']
    start_urls = [
        'https://www.bestattung-aichinger.at',
    ]

    rules = (
        Rule(LinkExtractor(restrict_text='Traueranzeigen'), callback='parse_obi', follow=True),
        Rule(LinkExtractor(restrict_css=".seitenzahlen"), callback='parse_obi', follow=True),
    )

    def parse_obi(self, response):
        logging.info(response.url)

        for post in response.css('.homepage_unterseiten_layout_todesanzeigen'):
            for entry in post.css('a'):
                item = {
                    'name': entry.css('.homepage_unterseiten_layout_titel::text').get(),
                    'date': entry.css('.homepage_unterseiten_layout_datum::text').get()
                }
                yield item

Answered By - msenior_

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, July 30, 2022

[FIXED] Scrapy CrawlSpider iterating through entire site

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels