Issue
I am trying to scrape data from a page and continue scraping following the pagination link.
The page I am trying to scrape is --> here
# -*- coding: utf-8 -*-
import scrapy
class AlibabaSpider(scrapy.Spider):
name = 'alibaba'
allowed_domains = ['alibaba.com']
start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1']
def parse(self, response):
for products in response.xpath('//div[contains(@class, "m-gallery-product-item-wrap")]'):
item = {
'product_name': products.xpath('.//h2/a/@title').extract_first(),
'price': products.xpath('.//div[@class="price"]/b/text()').extract_first('').strip(),
'min_order': products.xpath('.//div[@class="min-order"]/b/text()').extract_first(),
'company_name': products.xpath('.//div[@class="stitle util-ellipsis"]/a/@title').extract_first(),
'prod_detail_link': products.xpath('.//div[@class="item-img-inner"]/a/@href').extract_first(),
'response_rate': products.xpath('.//i[@class="ui2-icon ui2-icon-skip"]/text()').extract_first('').strip(),
#'image_url': products.xpath('.//div[@class=""]/').extract_first(),
}
yield item
#Follow the paginatin link
next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
if next_page_url:
yield scrapy.Request(url=next_page_url, callback=self.parse)
Problem
- The code is not able to follow the pagination link.
How can you help
- Modify the code to follow the pagination link.
Solution
To get your code working, you need to fix the broken link by using response.follow()
or something similar. Try the below approach.
import scrapy
class AlibabaSpider(scrapy.Spider):
name = 'alibaba'
allowed_domains = ['alibaba.com']
start_urls = ['https://www.alibaba.com/catalog/agricultural-growing-media_cid144?page=1']
def parse(self, response):
for products in response.xpath('//div[contains(@class, "m-gallery-product-item-wrap")]'):
item = {
'product_name': products.xpath('.//h2/a/@title').extract_first(),
'price': products.xpath('.//div[@class="price"]/b/text()').extract_first('').strip(),
'min_order': products.xpath('.//div[@class="min-order"]/b/text()').extract_first(),
'company_name': products.xpath('.//div[@class="stitle util-ellipsis"]/a/@title').extract_first(),
'prod_detail_link': products.xpath('.//div[@class="item-img-inner"]/a/@href').extract_first(),
'response_rate': products.xpath('.//i[@class="ui2-icon ui2-icon-skip"]/text()').extract_first('').strip(),
#'image_url': products.xpath('.//div[@class=""]/').extract_first(),
}
yield item
#Follow the paginatin link
next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
if next_page_url:
yield response.follow(url=next_page_url, callback=self.parse)
Your pasted code was badly indented. I've fixed that as well.
Answered By - SIM
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.