Issue
As stated in the title, I am looking to replace the image path name with the item path name here is my example:
Running my scrapy, I get the files as the standard SHA1 hash format.
If possible I would also appreciate if it can get the first image instead of the whole group.
URL name - https://www.antaira.com/products/10-100Mbps/LNX-500A Expected image name - LNX-500A.jpg
Spider.py
from copyreg import clear_extension_cache
import scrapy
from ..items import AntairaItem
class ImageDownload(scrapy.Spider):
name = 'ImageDownload'
allowed_domains = ['antaira.com']
start_urls = [
'https://www.antaira.com/products/10-100Mbps/LNX-500A',
]
def parse_images(self, response):
raw_image_urls = response.css('.image img ::attr(src)').getall()
clean_image_urls = []
for img_url in raw_image_urls:
clean_image_urls.append(response.urljoin(img_url))
yield {
'image_urls' : clean_image_urls
}
pipelines.py
from scrapy.pipelines.images import ImagesPipeline
import json
class AntairaPipeline:
def process_item(self, item, spider):
# calling dumps to create json data.
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
def open_spider(self, spider):
self.file = open('result.json', 'w')
def close_spider(self, spider):
self.file.close()
class customImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
#item-request.meta['item'] # Like this you can use all from the item, not just url
#image_guid = request.meta.get('filename', '')
image_guid = request.url.split('/')[-1]
#image_direct = request.meta.get('directoryname', '')
return 'full/%s.jpg' % (image_guid)
#Name thumbnail version
def thumb_path(self, request, thumb_id, response=None, info=None):
image_guid = thumb_id + response.url.split('/')[-1]
return 'thumbs/%s/%s.jpg' % (thumb_id, image_guid)
def get_media_requests(self, item, info):
#return [Request(x, meta={'filename': item['image_name']})
# for x in item.get(self.image_urls_field, [])]\
for image in item['images']:
yield Request(image)
I understand there is a way to get the meta data, but I would like it to have the name of the item product if possible for the image, thank you.
EDIT -
Original File Name - 12f6537bd206cf58e86365ed6b7c1fb446c533b2.jpg
Required file name - "LNX_500A_01.jpg" - using the last part of the start_url path if more than one if not than "LNX_500A.jpg"
Solution
So it took a little tweaking but this is what I got. I extracted the name of the item and all of the images with xpath
expressions, and then in the image pipeline I add the item name and the file numbers to the requests
meta keyword arg
. Then add those two together in the file_path
method of the pipeline.
You could just as easily split the request url
and use that as the file name as well. Both approaches will do the trick.
Also for some reason I wasn't getting any images at all with the css
selector so I switched it to an xpath
expression. If the css
works for you then you can switch it back and it should still work.
spider file
import scrapy
from ..items import MyItem
class ImageDownload(scrapy.Spider):
name = 'ImageDownload'
allowed_domains = ['antaira.com']
start_urls = [
'https://www.antaira.com/products/10-100Mbps/LNX-500A',
]
def parse(self, response):
item = MyItem()
raw_image_urls = response.xpath('//div[@class="selectors"]/a/@href').getall()
name = response.xpath("//h1[@class='product-name']/text()").get()
filename = name.split(' ')[0].strip()
urls = [response.urljoin(i) for i in raw_image_urls]
item["name"] = filename
item["image_urls"] = urls
yield item
items.py
from scrapy import Item, Field
class MyItem(Item):
name = Field()
image_urls = Field()
images = Field()
pipelines.py
from scrapy.http import Request
from scrapy.pipelines.images import ImagesPipeline
class ImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None, *args, item=None):
filename = request.meta["filename"].strip()
number = request.meta["file_num"]
return filename + "_" + str(number) + ".jpg"
def get_media_requests(self, item, info):
name = item["name"]
for i, url in enumerate(item["image_urls"]):
meta = {"filename": name, "file_num": i}
yield Request(url, meta=meta)
settings.py
ITEM_PIPELINES = {
'project.pipelines.ImagePipeline': 1,
}
IMAGES_STORE = 'image_dir'
IMAGES_URLS_FIELD = 'image_urls'
IMAGES_RESULT_FIELD = 'images'
With all of this and running scrapy crawl ImageDownloads
it creates this directory:
Project
| - image_dir
| | - LNX-500A_0.jpg
| | - LNX-500A_1.jpg
| | - LNX-500A_2.jpg
| | - LNX-500A_3.jpg
| | - LNX-500A_4.jpg
|
| - project
| - __init__.py
| - items.py
| - middlewares.py
| - pipelines.py
| - settings.py
|
| - spiders
| - antaira.py
And these are the files that were created.
LNX-500A_0.jpg LNX-500A_1.jpg LNX-500A_2.jpg LNX-500A_3.jpg LNX-500A_4.jpg
Answered By - Alexander
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.