Issue
Please see below an example version of my code, which uses the Scrapy Image Pipeline to download/scrape images from a site:
import scrapy
from scrapy_splash import SplashRequest
from imageExtract.items import ImageextractItem
class ExtractSpider(scrapy.Spider):
name = 'extract'
start_urls = ['url']
def parse(self, response):
image = ImageextractItem()
titles = ['a', 'b', 'c', 'd', 'e', 'f']
rel = ['url1', 'url2', 'url3', 'url4', 'url5', 'url6']
image['title'] = titles
image['image_urls'] = rel
return image
It all works fine but as per default settings, avoids downloading duplicates. Is there any way of overriding this so that I can download the duplicates also? Thanks.
Solution
I think one possible solution is to create your own image pipeline inherited from scrapy.pipelines.images.ImagesPipeline
with overridden method get_media_requests
(see documentation for example). While yielding the scrapy.Request
, pass dont_filter=True
to the constructor.
Answered By - Tomáลก Linhart
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.