Issue
I am trying to scrape playstation webstore to scrape title, gamelink from the main page and Price for each game from the second page. However when using callback function to parse_page2, all the returned items contain the title and item['link'] value of the most recent item. (last of us remastered )
My code Below:
class PsStoreSpider(scrapy.Spider):
name = 'psstore'
start_urls =['https://store.playstation.com/en-ie/pages/browse']
def parse(self, response):
item = PlaystationItem()
products = response.css('a.psw-link')
for product in products:
item['main_url'] = response.url
item['title'] = product.css('span.psw-t-body.psw-c-t-1.psw-t-truncate-2.psw-m-b-2::text').get()
item['link'] = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
link = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
request = Request(link, callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['price'] = response.css('span.psw-t-title-m::text').get()
item['other_url'] = response.url
yield item
And part of the output:
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/229261>
{'link': 'https://store.playstation.com/en-ie/concept/228638',
'main_url': 'https://store.playstation.com/en-ie/pages/browse',
'other_url': 'https://store.playstation.com/en-ie/concept/229261',
'price': 'Free',
'title': 'The Last of Us™ Remastered'}
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/232847>
{'link': 'https://store.playstation.com/en-ie/concept/228638',
'main_url': 'https://store.playstation.com/en-ie/pages/browse',
'other_url': 'https://store.playstation.com/en-ie/concept/232847',
'price': '€59.99',
'title': 'The Last of Us™ Remastered'}
2022-05-09 19:54:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://store.playstation.com/en-ie/concept/224802> (referer: https://store.playstation.com/en-ie/pages/browse)
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/224802>
{'link': 'https://store.playstation.com/en-ie/concept/228638',
'main_url': 'https://store.playstation.com/en-ie/pages/browse',
'other_url': 'https://store.playstation.com/en-ie/concept/224802',
'price': '€29.99',
'title': 'The Last of Us™ Remastered'}
As you can see the price is correctly returned but title and link are taken from the last scraped object. What am I missing here?
Thanks
Solution
The thing is that you create your item
at the beginning of your parse method and then update it over and over. That also means that you always pass the same item to parse_page2
.
If you were to create your item in the for
-loop you would get a new one in every iteration and should get the expected result.
Like this:
def parse(self, response):
products = response.css('a.psw-link')
for product in products:
item = PlaystationItem()
item['main_url'] = response.url
item['title'] = product.css('span.psw-t-body.psw-c-t-1.psw-t-truncate-2.psw-m-b-2::text').get()
item['link'] = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
link = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
request = Request(link, callback=self.parse_page2)
request.meta['item'] = item
yield request
Answered By - Patrick Klein
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.