Monday, June 20, 2022

[FIXED] How to deal with different rows in Xpath while crawling with Scrapy?

June 20, 2022 python, scrapy, web-crawler No comments

Issue

I am trying to scrape a website using Scrapy, for its product links. I have already figured out how to get links for all the sub-categories, but now as I enter the page, where the products are shown, I can't find a solution to extract all the elements using Xpath. The initial question is How do you deal with different row numbers in Xpath / Scrapy to get all the items?

Target page example: https://www.rimi.lt/e-parduotuve/lt/produktai/veganams-ir-vegetarams/c/SH-77

I am testing everything on Scrapy Shell first

Xpath to get the product card @href (This one is using copy Full Xpath option in Chrome): response.xpath('/html/body/main/section/div/div/div/div/div/div/ul/li[1]/div/a/@href').extract() The next item Xpath would have an incremented li[1] value: Example:

//*[@id="main"]/section/div[1]/div/div[2]/div[1]/div/div[2]/ul/li[3]/div/a
                                                                  ^
//*[@id="main"]/section/div[1]/div/div[2]/div[1]/div/div[2]/ul/li[2]/div/a
                                                                  ^

The function where I am declaring my xpaths in mySpider.py file:

def __init__(self):
        self.declare_xpath()

        #All the XPaths the spider needs to know go here
    def declare_xpath(self):
        self.getAllCategoriesXpath = ""
        self.getAllSubCategoriesXpath = ""
        self.getAllItemsXpath = '/html/body/main/nav[1]/div/ul/li[1]/a/@href'
        self.TitleXpath  = ""
        self.CategoryXpath = ""
        self.PriceXpath = ""
        self.FeaturesXpath = ""
        self.DescriptionXpath = ""
        self.SpecsXpath = ""

Solution

While dealing with Xpaths and different row numbers, ignore the [x] elements and just put the item without the [x].

Example:

//*

[@id="main"]/section/div[1]/div/div[2]/div[1]/div/div[2]/ul/li[3]/div/a/@href
                                                           ^^^^^^

How to get all the elements:

//*[@id="main"]/section/div[1]/div/div[2]/div[1]/div/div[2]/ul/li/div/a/@href
                                                              ^^^

Answered By - Upsice

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, June 20, 2022

[FIXED] How to deal with different rows in Xpath while crawling with Scrapy?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels