Friday, December 29, 2023

[FIXED] Scrapy pipeline doesn't runs

December 29, 2023 python, scrapy No comments

Issue

I have the following spider:

class WebSpider(scrapy.Spider):
    name = "web"
    allowed_domains = ["www.web.com"]
    start_urls = ["https://www.web.com/page/"]
    custom_settings = {
        "ITEM_PIPELINES": {
            "models.pipelines.ModelsPipeline": 1,
            "models.pipelines.MongoDBPipeline": 2,
        },
        "IMAGES_STORE": get_project_settings().get("FILES_STORE"),
    }

def parse_models(self, response):
    ...
    yield WebItem(image_urls=[img_url], images=[name], name=name, collection="web")


class WebItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    name = scrapy.Field()
    collection = scrapy.Field()

the MongoDBPipeline works alwais with the followings configurations

"ITEM_PIPELINES": {
    "models.pipelines.ModelsPipeline": 1,
    "models.pipelines.MongoDBPipeline": 2,
}

"ITEM_PIPELINES": {
    "models.pipelines.MongoDBPipeline": 2,
}

but the ModelsPipeline never runs in any of the following configurations

"ITEM_PIPELINES": {
    "models.pipelines.ModelsPipeline": 1,
    "models.pipelines.MongoDBPipeline": 2,
}

"ITEM_PIPELINES": {
    "models.pipelines.ModelsPipeline": 1,
}

The ModelsPipeline is in the same file that MongoDBPipeline and the code is the following:

class ModelsPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        pdb.set_trace()
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        pdb.set_trace()
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        adapter = ItemAdapter(item)
        adapter['image_paths'] = image_paths
        return item

but never executes get_media_requests or item_completed

the code is the same that the doc https://docs.scrapy.org/en/latest/topics/media-pipeline.html

What is wrong and what scrapy doesn't runs the ModelsPipeline

EDIT

Scrapy version is 2.8.0

Thanks.

Solution

When using the Media pipelines, you need to have all the appropriate settings populated in order for them to work.

In your case the ModelsPipeline is inheriting from the ImagesPipeline, so you must fulfill all of the ImagesPipeline requirements.

Those include:

IMAGES_STORE setting ... not FILES_STORE
The scrapy item needs to have the appropriate fields

You can set those fields custom using the

IMAGES_URLS_FIELD = 'field_name_for_your_images_urls'
IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'

Or you can use the default fields, or both:

import scrapy

class MyItem(scrapy.Item):
    # ... other item fields ...
    image_urls = scrapy.Field()
    images = scrapy.Field()

There are several other optional settings that need to be properly set if you choose to use them, and you must make sure that your IMAGES_STORE path already exists.

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 29, 2023

[FIXED] Scrapy pipeline doesn't runs

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels