Monday, August 22, 2022

[FIXED] Scrapy - Replace downloaded image path with item url link

August 22, 2022 python, scrapy No comments

Issue

As stated in the title, I am looking to replace the image path name with the item path name here is my example:

Running my scrapy, I get the files as the standard SHA1 hash format.

If possible I would also appreciate if it can get the first image instead of the whole group.

URL name - https://www.antaira.com/products/10-100Mbps/LNX-500A Expected image name - LNX-500A.jpg

Spider.py

from copyreg import clear_extension_cache
import scrapy
from ..items import AntairaItem


class ImageDownload(scrapy.Spider):
    name = 'ImageDownload'
    allowed_domains = ['antaira.com']
    start_urls = [
        'https://www.antaira.com/products/10-100Mbps/LNX-500A',
    ]

        def parse_images(self, response):
        raw_image_urls = response.css('.image img ::attr(src)').getall()
        clean_image_urls = []
        for img_url in raw_image_urls:
            clean_image_urls.append(response.urljoin(img_url))
            yield {
                'image_urls' : clean_image_urls
            }

pipelines.py


from scrapy.pipelines.images import ImagesPipeline
import json

class AntairaPipeline:
    def process_item(self, item, spider):

        # calling dumps to create json data.
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item
    
    def open_spider(self, spider):
        self.file = open('result.json', 'w')
        
    def close_spider(self, spider):
        self.file.close()


class customImagePipeline(ImagesPipeline):
        
    def file_path(self, request, response=None, info=None):
        #item-request.meta['item'] # Like this you can use all from the item, not just url
        #image_guid = request.meta.get('filename', '')
        image_guid = request.url.split('/')[-1]
        #image_direct = request.meta.get('directoryname', '')
        return 'full/%s.jpg' % (image_guid)

    #Name thumbnail version
    def thumb_path(self, request, thumb_id, response=None, info=None):
        image_guid = thumb_id + response.url.split('/')[-1]
        return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)

    def get_media_requests(self, item, info):
        #return [Request(x, meta={'filename': item['image_name']})
        #    for x in item.get(self.image_urls_field, [])]\
        for image in item['images']:
            yield Request(image)

I understand there is a way to get the meta data, but I would like it to have the name of the item product if possible for the image, thank you.

EDIT -

Original File Name - 12f6537bd206cf58e86365ed6b7c1fb446c533b2.jpg

Required file name - "LNX_500A_01.jpg" - using the last part of the start_url path if more than one if not than "LNX_500A.jpg"

Solution

So it took a little tweaking but this is what I got. I extracted the name of the item and all of the images with xpath expressions, and then in the image pipeline I add the item name and the file numbers to the requests meta keyword arg. Then add those two together in the file_path method of the pipeline.

You could just as easily split the request url and use that as the file name as well. Both approaches will do the trick.

Also for some reason I wasn't getting any images at all with the css selector so I switched it to an xpath expression. If the css works for you then you can switch it back and it should still work.

spider file

import scrapy
from ..items import MyItem

class ImageDownload(scrapy.Spider):
    name = 'ImageDownload'
    allowed_domains = ['antaira.com']
    start_urls = [
        'https://www.antaira.com/products/10-100Mbps/LNX-500A',
    ]

    def parse(self, response):
        item = MyItem()
        raw_image_urls = response.xpath('//div[@class="selectors"]/a/@href').getall()
        name = response.xpath("//h1[@class='product-name']/text()").get()
        filename = name.split(' ')[0].strip()
        urls = [response.urljoin(i) for i in raw_image_urls]
        item["name"] = filename
        item["image_urls"] = urls
        yield item

items.py

from scrapy import Item, Field

class MyItem(Item):
    name = Field()
    image_urls = Field()
    images = Field()

pipelines.py

from scrapy.http import Request
from scrapy.pipelines.images import ImagesPipeline

class ImagePipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None, *args, item=None):
        filename = request.meta["filename"].strip()
        number = request.meta["file_num"]
        return filename + "_" + str(number) + ".jpg"

    def get_media_requests(self, item, info):
        name = item["name"]
        for i, url in enumerate(item["image_urls"]):
            meta = {"filename": name, "file_num": i}
            yield Request(url, meta=meta)

settings.py

ITEM_PIPELINES = {
   'project.pipelines.ImagePipeline': 1,
}
IMAGES_STORE = 'image_dir'
IMAGES_URLS_FIELD = 'image_urls'
IMAGES_RESULT_FIELD = 'images'

With all of this and running scrapy crawl ImageDownloads it creates this directory:

Project
 | - image_dir
 |     | - LNX-500A_0.jpg
 |     | - LNX-500A_1.jpg
 |     | - LNX-500A_2.jpg
 |     | - LNX-500A_3.jpg
 |     | - LNX-500A_4.jpg  
 |
 | - project
       | - __init__.py
       | - items.py
       | - middlewares.py
       | - pipelines.py
       | - settings.py
       |
       | - spiders
             | - antaira.py

And these are the files that were created.

LNX-500A_0.jpg LNX-500A_1.jpg LNX-500A_2.jpg LNX-500A_3.jpg LNX-500A_4.jpg

Answered By - Alexander

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, August 22, 2022

[FIXED] Scrapy - Replace downloaded image path with item url link

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels