Issue
I am trying to scrape the following website:
I am successful in scraping the first page, but I have trouble going to the next pages. There are two reasons for this:
When inspecting the
next_page
button I don't get a relative or an absolute URL. Instead I getJavaScript:getPage(2)
which I can't use to follow linksThe next page button link can be accessed via
(//table[@class='tbl_pagination']//a//@href)[11]
when being on the first page, but from the 2nd page and onwards, the next page button is the 12th item, i.e.(//table[@class='tbl_pagination']//a//@href)[12]
So ultimately my question is, how do I effectively go to ALL the subsequent pages and scrape the data.
This is probably very simple to solve, but I am a beginner in web scraping so any feedback is appreciated. Please see below my code.
Thanks for your help.
**
import scrapy
from scrapy_selenium import SeleniumRequest
class WinesSpider(scrapy.Spider):
name = 'wines'
def start_requests(self):
yield SeleniumRequest(
url='https://www.getwines.com/category_Wine',
wait_time=3,
callback=self.parse
)
def parse(self, response):
products = response.xpath("(//div[@class='layMain']//tbody)[5]/tr ")
for product in products:
yield {
'product_name':
product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
'product_link':
product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
'product_actual_price':
product.xpath(".//td//td[3]//td/span[2]/text()").get(),
'product_price_onsale':
product.xpath(".//td//td[3]//td/span[4]/text()").get()
}
#next_page = response.xpath("(//table[@class='tbl_pagination']//a//@href)[11]").get()
#if next_page:
# absolute_url = f"'https://www.getwines.com/category_Wine"**
Solution
Please see below the code that answers the above question.
In a nutshell, I changed the structure of the code and it now works perfectly. Some remarks:
- Firstly, save all the content of the pages in a list
- It is important to use the "except NoSuchElementException" at the end of the while-try loop --> Before adding this, the code kept failing as it did not know what to do once the last page was reached.
- Access the content of the stored links (responses).
All in all, I think structuring your Scrapy code this way works well when integrating Selenium with Scrapy. However, as I am a beginner with Web Scraping, any additional feedback on how to integrate Selenium with Scrapy efficiently will be appreciated.
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector
from scrapy_selenium import SeleniumRequest
from selenium.common.exceptions import NoSuchElementException
class WinesSpider(scrapy.Spider):
name = 'wines'
responses = []
def start_requests(self):
yield SeleniumRequest(
url='https://www.getwines.com/category_Wine',
callback=self.parse
)
def parse(self, response):
driver = response.meta['driver']
intial_page = driver.page_source
self.responses.append(intial_page)
found = True
while found:
try:
next_page = driver.find_element_by_xpath("//b[text()= '>>']/parent::a")
href = next_page.get_attribute('href')
driver.execute_script(href)
driver.implicitly_wait(2)
self.responses.append(driver.page_source)
except NoSuchElementException:
break
for resp in self.responses:
r = Selector(text=resp)
products = r.xpath("(//div[@class='layMain']//tbody)[5]/tr")
for product in products:
yield {
'product_name':
product.xpath(".//a[@class='Srch-producttitle']/text()").get(),
'product_link':
product.xpath(".//a[@class='Srch-producttitle']/@href").get(),
'product_actual_price':
product.xpath(".//span[@class='RegularPrice']/text()").get(),
'product_price_onsale':
product.xpath(".//td//td[3]//td/span[4]/text()").get()
}
Answered By - sophocles
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.