Friday, December 29, 2023

[FIXED] Scrapy Pipeline Is Not Filtering Duplicate Items

December 29, 2023 python, scrapy, web-scraping No comments

Issue

I have a web scraper running on the Scrapy framework that scrapes product data. There is a pipeline set up that is supposed filter out duplicate skus/products, but after doing a full run I can see it is not filtering them out. I had many duplicates.
Here are the pipelines I have set up:

ITEM_PIPELINES = {
    "pure_scraper.pipelines.DuplicateSkuPipeline": 300,
    "pure_scraper.pipelines.FieldValidatorPipeline": 400,
    "scrapy.pipelines.images.ImagesPipeline": 500,
}

The first item pipeline filters duplicate skus, the second makes sure that the item has the required fields, and third downloads product images.
Here is the code to my Duplicate Sku Pipeline:

class DuplicateSkuPipeline:
    def __init__(self):
        self.skus = set()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        sku = adapter.get("sku")
        if sku in self.skus:
            raise DropItem(f"Duplicate sku found: {sku}")
        else:
            self.skus.add(sku)
            return item

    def close_spider(self, spider):
        spider.logger.info(f"Found {len(self.skus)} unique skus")

I have run through specific products in debug mode and it seems to work when I run 1 product at a time. But when I ran it end-to-end is when the duplicates occur. Should I be storing these skus in a database table instead of a set within the class?

Solution

In the Scrapy documentation under the "Enabling your Media Pipeline", it says:

To enable your media pipeline you must first add it to your project ITEM_PIPELINES setting.

For Images Pipeline, use:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

So I changed the ITEM_PIPELINES setting in the settings.py, making ImagesPipeline the first pipeline:

ITEM_PIPELINES = {
    "scrapy.pipelines.images.ImagesPipeline": 1,
    "pure_scraper.pipelines.FieldValidatorPipeline": 300,
    "pure_scraper.pipelines.DuplicateSkuPipeline": 400
}

This fixed my issue. If you are having issues with your pipeline and you use the ImagesPipline, make sure it is set to 1.

Answered By - Matthew G

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 29, 2023

[FIXED] Scrapy Pipeline Is Not Filtering Duplicate Items

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels