Friday, March 4, 2022

[FIXED] Dealing with pagination when using scrapy-selenium (POST request)

March 04, 2022 javascript, pagination, post, scrapy, selenium No comments

Issue

I am trying to scrape the following website:

https://www.getwines.com/main.asp?request=search&type=w&s1=s9818865857&fbclid=IwAR3yF9x1X7sdPYgsfl4vF1oNF7GNoF1pSov4lwJLEeeTYFGevBTfRKOPBmo

I am successful in scraping the first page, but I have trouble going to the next pages. There are two reasons for this:

When inspecting the next_page button I don't get a relative or an absolute URL. Instead I get JavaScript:getPage(2) which I can't use to follow links
The next page button link can be accessed via (//table[@class='tbl_pagination']//a//@href)[11] when being on the first page, but from the 2nd page and onwards, the next page button is the 12th item, i.e. (//table[@class='tbl_pagination']//a//@href)[12]

So ultimately my question is, how do I effectively go to ALL the subsequent pages and scrape the data.

This is probably very simple to solve, but I am a beginner in web scraping so any feedback is appreciated. Please see below my code.

Thanks for your help.

**
import scrapy
from scrapy_selenium import SeleniumRequest
class WinesSpider(scrapy.Spider):
    name = 'wines'
  
    def start_requests(self):
        yield SeleniumRequest(
        url='https://www.getwines.com/category_Wine',
        wait_time=3,
        callback=self.parse
        )
    def parse(self, response):
        products = response.xpath("(//div[@class='layMain']//tbody)[5]/tr ")
        for product in products:
            yield {
                'product_name': 
                product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
                'product_link': 
                product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
                'product_actual_price': 
                product.xpath(".//td//td[3]//td/span[2]/text()").get(),
                'product_price_onsale': 
                product.xpath(".//td//td[3]//td/span[4]/text()").get()
            }
    #next_page = response.xpath("(//table[@class='tbl_pagination']//a//@href)[11]").get()
    #if next_page:
    #    absolute_url = f"'https://www.getwines.com/category_Wine"**

Solution

Please see below the code that answers the above question.

In a nutshell, I changed the structure of the code and it now works perfectly. Some remarks:

Firstly, save all the content of the pages in a list
It is important to use the "except NoSuchElementException" at the end of the while-try loop --> Before adding this, the code kept failing as it did not know what to do once the last page was reached.
Access the content of the stored links (responses).

All in all, I think structuring your Scrapy code this way works well when integrating Selenium with Scrapy. However, as I am a beginner with Web Scraping, any additional feedback on how to integrate Selenium with Scrapy efficiently will be appreciated.

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from selenium.common.exceptions import NoSuchElementException

class WinesSpider(scrapy.Spider):
    name = 'wines'

    responses = []

    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.getwines.com/category_Wine',
            callback=self.parse
        )

    def parse(self, response):
        driver = response.meta['driver']
        intial_page = driver.page_source
        self.responses.append(intial_page)
        found = True
        while found:
            try:
                next_page = driver.find_element_by_xpath("//b[text()= '>>']/parent::a")
                href = next_page.get_attribute('href')
                driver.execute_script(href)
                driver.implicitly_wait(2)
                self.responses.append(driver.page_source)
        
            except NoSuchElementException:
                break

        for resp in self.responses:
            r = Selector(text=resp)
            products = r.xpath("(//div[@class='layMain']//tbody)[5]/tr")
            for product in products:
                yield {
                    'product_name':
                    product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
                    'product_link':
                    product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
                    'product_actual_price':
                    product.xpath(".//span[@class='RegularPrice']/text()").get(),
                    'product_price_onsale':
                    product.xpath(".//td//td[3]//td/span[4]/text()").get()
                }

Answered By - sophocles

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, March 4, 2022

[FIXED] Dealing with pagination when using scrapy-selenium (POST request)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels