Issue
I am new to scrapy and I've come across a complicated case.
I have to make 3 get requests in order to make Product
items.
product_url
category_url
stock_url
First, I need a request to product_url
and a request to category_url
to fill out the fields of Product
items. I then need to refer to stock_url
's response to determine whether to save or discard the created items.
Here's what I'm doing right now:
In my spider,
def start_requests(self):
product_url = 'https://www.someurl.com/product?'
item = ProductItem()
yield scrapy.Request(product_url, self.parse_products, meta={'item':item})
def parse_products(self, response):
# fill out 1/2 of fields of ProductItem
item = response.meta['item']
item[name] = response.xpath(...)
item[id] = response.xpath(...)
category_url = 'https://www.someurl.com/category?'
yield scrapy.Request(category_url, self.parse_products2, meta={'item':item})
def parse_products2(self, response):
# fill out rest of fields of ProductItem
item = response.meta['item']
item[img_url] = response.xpath(...)
item[link_url] = response.xpath(...)
stock_url = 'https://www.someurl.com/stock?'
yield scrapy.Request(stock_url, self.parse_final, meta={'item':item})
def parse_final(self, response):
item = response.meta['item']
for each prod in response:
if prod.id == item['id'] & !prod.in_stock:
#drop item
Question: I was told before to handle the item-dropping logic in the pipeline. But whether I drop an item or not depends on making another GET request. Should I still move this logic to the pipelines/ is this possible without inheriting scrapy.Spider?
Solution
Moving the item dropping logic to the pipeline is probably the best design.
You can use the (undocumented) scrapy engine api to download requests in a pipeline. Example assuming the stock info for all items can be accessed from a single url:
import scrapy
from scrapy.exceptions import DropItem
from twisted.internet.defer import inlineCallbacks
class StockPipeline(object):
@inlineCallbacks
def open_spider(self, spider):
req = scrapy.Request(stock_url)
response = yield spider.crawler.engine.download(req, spider)
# extract the stock info from the response
self.stock_info = response.text
def process_item(self, item, spider):
# check if the item should be dropped
if item['id'] not in self.stock_info:
raise DropItem
return item
If there is a separate, per-item url for stock info, you'd simply do the downloading in process_item()
instead.
Answered By - stranac
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.