Monday, January 17, 2022

[FIXED] multiple requests to make an item in scrapy

January 17, 2022 python, scrapy, web-scraping No comments

Issue

I am new to scrapy and I've come across a complicated case.

I have to make 3 get requests in order to make Product items.

product_url
category_url
stock_url

First, I need a request to product_url and a request to category_url to fill out the fields of Product items. I then need to refer to stock_url's response to determine whether to save or discard the created items.

Here's what I'm doing right now:

In my spider,

def start_requests(self):
    product_url = 'https://www.someurl.com/product?'

    item = ProductItem()
    yield scrapy.Request(product_url, self.parse_products, meta={'item':item})

def parse_products(self, response):
    # fill out 1/2 of fields of ProductItem
    item = response.meta['item']
    item[name] = response.xpath(...)
    item[id] = response.xpath(...)

    category_url = 'https://www.someurl.com/category?'
    yield scrapy.Request(category_url, self.parse_products2, meta={'item':item})

def parse_products2(self, response):
    # fill out rest of fields of ProductItem
    item = response.meta['item']
    item[img_url] = response.xpath(...)
    item[link_url] = response.xpath(...)

    stock_url = 'https://www.someurl.com/stock?'
    yield scrapy.Request(stock_url, self.parse_final, meta={'item':item})

def parse_final(self, response):
    item = response.meta['item']

    for each prod in response:
        if prod.id == item['id'] & !prod.in_stock:
            #drop item

Question: I was told before to handle the item-dropping logic in the pipeline. But whether I drop an item or not depends on making another GET request. Should I still move this logic to the pipelines/ is this possible without inheriting scrapy.Spider?

Solution

Moving the item dropping logic to the pipeline is probably the best design.

You can use the (undocumented) scrapy engine api to download requests in a pipeline. Example assuming the stock info for all items can be accessed from a single url:

import scrapy
from scrapy.exceptions import DropItem
from twisted.internet.defer import inlineCallbacks


class StockPipeline(object):
    @inlineCallbacks
    def open_spider(self, spider):
        req = scrapy.Request(stock_url)
        response = yield spider.crawler.engine.download(req, spider)
        # extract the stock info from the response
        self.stock_info = response.text

    def process_item(self, item, spider):
        # check if the item should be dropped
        if item['id'] not in self.stock_info:
            raise DropItem
        return item

If there is a separate, per-item url for stock info, you'd simply do the downloading in process_item() instead.

Answered By - stranac

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 17, 2022

[FIXED] multiple requests to make an item in scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels