Saturday, May 14, 2022

[FIXED] Scrapy Parse function not passing found values to the parse_page2 function

May 14, 2022 python, scrapy No comments

Issue

I am trying to scrape playstation webstore to scrape title, gamelink from the main page and Price for each game from the second page. However when using callback function to parse_page2, all the returned items contain the title and item['link'] value of the most recent item. (last of us remastered )

My code Below:

class PsStoreSpider(scrapy.Spider):
    name = 'psstore'
    start_urls =['https://store.playstation.com/en-ie/pages/browse']

    def parse(self, response):
        item = PlaystationItem()
        products = response.css('a.psw-link')
 
        for product in products:

            item['main_url'] = response.url
            item['title'] = product.css('span.psw-t-body.psw-c-t-1.psw-t-truncate-2.psw-m-b-2::text').get()
            item['link'] = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
            link = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']

            request = Request(link, callback=self.parse_page2)
            request.meta['item'] = item
            yield request

    def parse_page2(self, response):
        item = response.meta['item']
        item['price'] = response.css('span.psw-t-title-m::text').get()
        item['other_url'] = response.url
        yield item

And part of the output:

2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/229261> 
{'link': 'https://store.playstation.com/en-ie/concept/228638',
 'main_url': 'https://store.playstation.com/en-ie/pages/browse',
 'other_url': 'https://store.playstation.com/en-ie/concept/229261',
 'price': 'Free',
 'title': 'The Last of Us™ Remastered'}
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/232847> 
{'link': 'https://store.playstation.com/en-ie/concept/228638',
 'main_url': 'https://store.playstation.com/en-ie/pages/browse',
 'other_url': 'https://store.playstation.com/en-ie/concept/232847',
 'price': '€59.99',
 'title': 'The Last of Us™ Remastered'}
2022-05-09 19:54:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://store.playstation.com/en-ie/concept/224802> (referer: https://store.playstation.com/en-ie/pages/browse)
2022-05-09 19:54:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://store.playstation.com/en-ie/concept/224802> 
{'link': 'https://store.playstation.com/en-ie/concept/228638',
 'main_url': 'https://store.playstation.com/en-ie/pages/browse',
 'other_url': 'https://store.playstation.com/en-ie/concept/224802',
 'price': '€29.99',
 'title': 'The Last of Us™ Remastered'}

As you can see the price is correctly returned but title and link are taken from the last scraped object. What am I missing here?

Thanks

Solution

The thing is that you create your item at the beginning of your parse method and then update it over and over. That also means that you always pass the same item to parse_page2.
If you were to create your item in the for-loop you would get a new one in every iteration and should get the expected result.
Like this:

    def parse(self, response):
        products = response.css('a.psw-link')
 
        for product in products:
            item = PlaystationItem()
            item['main_url'] = response.url
            item['title'] = product.css('span.psw-t-body.psw-c-t-1.psw-t-truncate-2.psw-m-b-2::text').get()
            item['link'] = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']
            link = 'https://store.playstation.com' + product.css('a.psw-link.psw-content-link').attrib['href']

            request = Request(link, callback=self.parse_page2)
            request.meta['item'] = item
            yield request

Answered By - Patrick Klein

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, May 14, 2022

[FIXED] Scrapy Parse function not passing found values to the parse_page2 function

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels