Issue
I am trying to scrape a product page from ZARA. Like this one :https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115
My scrapy-splash container is running. In the shell I fetch the page
fetch('http://localhost:8050/render.html?url=https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115')
2021-05-14 14:30:42 [scrapy.core.engine] INFO: Spider opened
2021-05-14 14:30:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://localhost:8050/render.html?url=https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115> (referer: None)
Everything is working so far, and I am able to get the header and price. However, I want to get image URLs of the product.
I try to reach it by
response.css('img.media-image__image::attr(src)').getall()
But response is this:
['https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png', 'https://static.zara.net/stdstatic/1.211.0-b.44/images/transparent-background.png']
Which is all background image and not the real one. I can display images on the browser and I see that images coming in the network requests. Is it because it is loaded with AJAX requests? How do I solve this?
Solution
@samuelhogg deserves the credit for finding the json
, but here is an example spider showing how to get all the image urls from the page. Note that you don't even need to use splash here, I've not tested it with splash but I think it should still work.
from scrapy import Spider
import json
class Zara(Spider):
name = "zara"
start_urls = [
"https://www.zara.com/us/en/fitted-houndstooth-blazer-p07808160.html?v1=108967877&v2=1718115"
]
def parse(self, response):
# Find the json identified by @samuelhogg
data = response.css("script[type='application/ld+json']::text").get()
# Make a set of all the images in the json
images = {image for i in json.loads(data) for image in i["image"]}
# Do what you want with them!
print(images)
Answered By - tomjn
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.