Sunday, November 14, 2021

[FIXED] Scrapy: Following pagination link to scrape data

November 14, 2021 python, scrapy, web-scraping, xpath No comments

Issue

I am trying to scrape data from a page and continue scraping following the pagination link.

The page I am trying to scrape is --> here

# -*- coding: utf-8 -*-
import scrapy


class AlibabaSpider(scrapy.Spider):
    name = 'alibaba'
    allowed_domains = ['alibaba.com']
    start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1']

def parse(self, response):
    for products in response.xpath('//div[contains(@class, "m-gallery-product-item-wrap")]'):
        item = {
            'product_name': products.xpath('.//h2/a/@title').extract_first(),
            'price': products.xpath('.//div[@class="price"]/b/text()').extract_first('').strip(),
            'min_order': products.xpath('.//div[@class="min-order"]/b/text()').extract_first(),
            'company_name': products.xpath('.//div[@class="stitle util-ellipsis"]/a/@title').extract_first(),
            'prod_detail_link': products.xpath('.//div[@class="item-img-inner"]/a/@href').extract_first(),
            'response_rate': products.xpath('.//i[@class="ui2-icon ui2-icon-skip"]/text()').extract_first('').strip(),
            #'image_url': products.xpath('.//div[@class=""]/').extract_first(),
         }
        yield item

    #Follow the paginatin link
    next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
    if next_page_url:
        yield scrapy.Request(url=next_page_url, callback=self.parse)

Problem

The code is not able to follow the pagination link.

How can you help

Modify the code to follow the pagination link.

Solution

To get your code working, you need to fix the broken link by using response.follow() or something similar. Try the below approach.

import scrapy

class AlibabaSpider(scrapy.Spider):
    name = 'alibaba'
    allowed_domains = ['alibaba.com']
    start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1']

    def parse(self, response):
        for products in response.xpath('//div[contains(@class, "m-gallery-product-item-wrap")]'):
            item = {
            'product_name': products.xpath('.//h2/a/@title').extract_first(),
            'price': products.xpath('.//div[@class="price"]/b/text()').extract_first('').strip(),
            'min_order': products.xpath('.//div[@class="min-order"]/b/text()').extract_first(),
            'company_name': products.xpath('.//div[@class="stitle util-ellipsis"]/a/@title').extract_first(),
            'prod_detail_link': products.xpath('.//div[@class="item-img-inner"]/a/@href').extract_first(),
            'response_rate': products.xpath('.//i[@class="ui2-icon ui2-icon-skip"]/text()').extract_first('').strip(),
            #'image_url': products.xpath('.//div[@class=""]/').extract_first(),
            }
            yield item

        #Follow the paginatin link
        next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)

Your pasted code was badly indented. I've fixed that as well.

Answered By - SIM

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, November 14, 2021

[FIXED] Scrapy: Following pagination link to scrape data

Issue

Problem

How can you help

Solution

0 comments:

Post a Comment

Popular Posts

Labels