Issue
I have the following spider:
class WebSpider(scrapy.Spider):
name = "web"
allowed_domains = ["www.web.com"]
start_urls = ["https://www.web.com/page/"]
custom_settings = {
"ITEM_PIPELINES": {
"models.pipelines.ModelsPipeline": 1,
"models.pipelines.MongoDBPipeline": 2,
},
"IMAGES_STORE": get_project_settings().get("FILES_STORE"),
}
def parse_models(self, response):
...
yield WebItem(image_urls=[img_url], images=[name], name=name, collection="web")
class WebItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
name = scrapy.Field()
collection = scrapy.Field()
the MongoDBPipeline works alwais with the followings configurations
"ITEM_PIPELINES": {
"models.pipelines.ModelsPipeline": 1,
"models.pipelines.MongoDBPipeline": 2,
}
"ITEM_PIPELINES": {
"models.pipelines.MongoDBPipeline": 2,
}
but the ModelsPipeline never runs in any of the following configurations
"ITEM_PIPELINES": {
"models.pipelines.ModelsPipeline": 1,
"models.pipelines.MongoDBPipeline": 2,
}
"ITEM_PIPELINES": {
"models.pipelines.ModelsPipeline": 1,
}
The ModelsPipeline
is in the same file that MongoDBPipeline
and the code is the following:
class ModelsPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
pdb.set_trace()
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
pdb.set_trace()
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
adapter = ItemAdapter(item)
adapter['image_paths'] = image_paths
return item
but never executes get_media_requests or item_completed
the code is the same that the doc https://docs.scrapy.org/en/latest/topics/media-pipeline.html
What is wrong and what scrapy doesn't runs the ModelsPipeline
EDIT
Scrapy version is 2.8.0
Thanks.
Solution
When using the Media pipelines, you need to have all the appropriate settings populated in order for them to work.
In your case the ModelsPipeline is inheriting from the ImagesPipeline, so you must fulfill all of the ImagesPipeline requirements.
Those include:
IMAGES_STORE
setting ... notFILES_STORE
- The scrapy item needs to have the appropriate fields
- You can set those fields custom using the
IMAGES_URLS_FIELD = 'field_name_for_your_images_urls' IMAGES_RESULT_FIELD = 'field_name_for_your_processed_images'
- Or you can use the default fields, or both:
import scrapy class MyItem(scrapy.Item): # ... other item fields ... image_urls = scrapy.Field() images = scrapy.Field()
There are several other optional settings that need to be properly set if you choose to use them, and you must make sure that your IMAGES_STORE
path already exists.
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.