Wednesday, June 15, 2022

[FIXED] Scrapy Crawl Spider is not following the links

June 15, 2022 scrapy No comments

Issue

I wrote this script to get data from https://de.rs-online.com/web/c/automation/elektromechanische-magnete/hubmagnete-linear/. My goal is to follow all the links and extract items from all those pages. But I don't know what is wrong with this script it is not following the links. If I use basic spider then it is easily getting items from the page but for crawl spider, it is not working. It is not throwing any error but the following message_

2022-02-19 21:36:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-19 21:36:56 [scrapy.core.engine] INFO: Spider opened
2022-02-19 21:36:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-19 21:36:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ProductSpider(CrawlSpider):
    name = 'product'
    allowed_domains = ['de.rs-online.com']
    start_urls = ['https://de.rs-online.com/web/c/automation/elektromechanische-magnete/hubmagnete-linear/']

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//tr/td/div/div/div[2]/div/a"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            'Title': response.xpath("//h1/text()").get(),
            'Categories': response.xpath("(//ol[@class='Breadcrumbstyled__StyledBreadcrumbList-sc-g4avu2-1 gHzygm']/li/a)[4]/text()").get(),
            'RS Best.-Nr.': response.xpath("//dl[@data-testid='key-details-desktop']/dd[1]/text()").get(),
            'URL': response.url
        }

Solution

If you want to follow all links without any filtering then you can simply omit the restrict_xpaths argument in your Rule definition. However, note that your xpaths in the parse_item callback are not correct so you will still receive empty items. Recheck your xpaths and define them correctly to obtain the information you are after.

rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

Answered By - msenior_

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, June 15, 2022

[FIXED] Scrapy Crawl Spider is not following the links

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels