Issue
I have a simple CrawlSpider that crawls the first page of a specific website. I want to have the spider go ahead with ?p=1, ?p=2 and so on until it detects the end of site-iteration. How can I do that?
class PomosCrawlSpider(CrawlSpider):
name = 'crawlobituaries'
allowed_domains = ['some.at']
start_urls = [
'https://www.some.at',
]
rules = (
Rule(LinkExtractor(allow='traueranzeigen'), callback='parse_obi'),
)
def parse_obi(self, response):
logging.info(response.url)
for post in response.css('.homepage_unterseiten_layout_todesanzeigen'):
for entry in post.css('a'):
item = {
'name': entry.css('.homepage_unterseiten_layout_titel::text').get(),
'date': entry.css('.homepage_unterseiten_layout_datum::text').get()
}
yield item
Solution
The reason your spider is only scraping the first page is because you have not added follow=True
in your Rule
definition so that the spider follows the links and extracts further links. You also need to add a Rule
definition to follow the next pages which can be defined using restrict_css
method and include the class on the navigation div. See below sample code.
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import logging
class PomosCrawlSpider(CrawlSpider):
name = 'crawlobituaries'
allowed_domains = ['bestattung-aichinger.at']
start_urls = [
'https://www.bestattung-aichinger.at',
]
rules = (
Rule(LinkExtractor(restrict_text='Traueranzeigen'), callback='parse_obi', follow=True),
Rule(LinkExtractor(restrict_css=".seitenzahlen"), callback='parse_obi', follow=True),
)
def parse_obi(self, response):
logging.info(response.url)
for post in response.css('.homepage_unterseiten_layout_todesanzeigen'):
for entry in post.css('a'):
item = {
'name': entry.css('.homepage_unterseiten_layout_titel::text').get(),
'date': entry.css('.homepage_unterseiten_layout_datum::text').get()
}
yield item
Answered By - msenior_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.