Saturday, May 14, 2022

[FIXED] How to select Scrapy's xpath one before last element of a list <li>?

May 14, 2022 python, scrapy, web-crawler, web-scraping No comments

Issue

I am scraping an e-commerce website (ex. link: https://elektromarkt.lt/namu-apyvokos-prekes/virtuves-ir-stalo-reikmenys/keptuves). I am facing an issue while using pagination, the page does not have a specific tag or attribute for next page button (at the bottom of the website) and I realised I am not getting all the data. How may I select the one before last

element using xpaths? Before I tried to find out which of the elements it is but I realised some product lists have only 1-3 pages which makes them invalid.

This is my parsing function:

def parse_items(self,response):
    for href in response.xpath(self.getAllItemsXpath):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url,callback=self.parse_main_item, dont_filter=True)
        
    nexter_page = response.xpath('/html/body/div[1]/div[2]/div[1]/div[6]/div[2]/div[2]/div/div[2]/div[3]/div/div/div[2]/div[3]/div[1]/ul/li[12]/a/@href').extract_first()
    if nexter_page is None:
        next_page = response.xpath('/html/body/div[1]/div[2]/div[1]/div[6]/div[2]/div[2]/div/div[2]/div[3]/div/div/div[2]/div[3]/div[1]/ul/li[10]/a/@href').extract_first()
        url = response.urljoin(next_page)
        yield scrapy.Request(url, callback=self.parse)
    else: 
        url = response.urljoin(nexter_page)
        yield scrapy.Request(url, callback=self.parse)

Solution

But the page number is changing and showing on the browser's url and You can make the pagination from start_urls using for loop.

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls=['https://elektromarkt.lt/namu-apyvokos-prekes/virtuves-ir-stalo-reikmenys/keptuves?page='+str(x)+'' for x in range(1,3)]
        
    def parse(self, response):
       print(response.url)

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl()
    process.start()

Output:

https://elektromarkt.lt/namu-apyvokos-prekes/virtuves-ir-stalo-reikmenys/keptuves?page=1
https://elektromarkt.lt/namu-apyvokos-prekes/virtuves-ir-stalo-reikmenys/keptuves?page=2

 'downloader/response_status_count/200':

Answered By - F.Hoque

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, May 14, 2022

[FIXED] How to select Scrapy's xpath one before last element of a list <li>?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels