Issue
I wrote this script to get data from https://de.rs-online.com/web/c/automation/elektromechanische-magnete/hubmagnete-linear/. My goal is to follow all the links and extract items from all those pages. But I don't know what is wrong with this script it is not following the links. If I use basic spider then it is easily getting items from the page but for crawl spider, it is not working. It is not throwing any error but the following message_
2022-02-19 21:36:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-19 21:36:56 [scrapy.core.engine] INFO: Spider opened
2022-02-19 21:36:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-19 21:36:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class ProductSpider(CrawlSpider):
name = 'product'
allowed_domains = ['de.rs-online.com']
start_urls = ['https://de.rs-online.com/web/c/automation/elektromechanische-magnete/hubmagnete-linear/']
rules = (
Rule(LinkExtractor(restrict_xpaths="//tr/td/div/div/div[2]/div/a"), callback='parse_item', follow=True),
)
def parse_item(self, response):
yield {
'Title': response.xpath("//h1/text()").get(),
'Categories': response.xpath("(//ol[@class='Breadcrumbstyled__StyledBreadcrumbList-sc-g4avu2-1 gHzygm']/li/a)[4]/text()").get(),
'RS Best.-Nr.': response.xpath("//dl[@data-testid='key-details-desktop']/dd[1]/text()").get(),
'URL': response.url
}
Solution
If you want to follow all links without any filtering then you can simply omit the restrict_xpaths
argument in your Rule
definition. However, note that your xpaths in the parse_item
callback are not correct so you will still receive empty items. Recheck your xpaths and define them correctly to obtain the information you are after.
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
Answered By - msenior_
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.