Issue
I have a web scraper running on the Scrapy framework that scrapes product data. There is a pipeline set up that is supposed filter out duplicate skus/products, but after doing a full run I can see it is not filtering them out. I had many duplicates.
Here are the pipelines I have set up:
ITEM_PIPELINES = {
"pure_scraper.pipelines.DuplicateSkuPipeline": 300,
"pure_scraper.pipelines.FieldValidatorPipeline": 400,
"scrapy.pipelines.images.ImagesPipeline": 500,
}
The first item pipeline filters duplicate skus, the second makes sure that the item has the required fields, and third downloads product images.
Here is the code to my Duplicate Sku Pipeline:
class DuplicateSkuPipeline:
def __init__(self):
self.skus = set()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
sku = adapter.get("sku")
if sku in self.skus:
raise DropItem(f"Duplicate sku found: {sku}")
else:
self.skus.add(sku)
return item
def close_spider(self, spider):
spider.logger.info(f"Found {len(self.skus)} unique skus")
I have run through specific products in debug mode and it seems to work when I run 1 product at a time. But when I ran it end-to-end is when the duplicates occur. Should I be storing these skus in a database table instead of a set within the class?
Solution
In the Scrapy documentation under the "Enabling your Media Pipeline", it says:
To enable your media pipeline you must first add it to your project ITEM_PIPELINES setting.
For Images Pipeline, use:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
So I changed the ITEM_PIPELINES setting in the settings.py
, making ImagesPipeline
the first pipeline:
ITEM_PIPELINES = {
"scrapy.pipelines.images.ImagesPipeline": 1,
"pure_scraper.pipelines.FieldValidatorPipeline": 300,
"pure_scraper.pipelines.DuplicateSkuPipeline": 400
}
This fixed my issue. If you are having issues with your pipeline and you use the ImagesPipline, make sure it is set to 1.
Answered By - Matthew G
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.