Issue
I am trying to scrape multiple pages but my crawler ends up cycling between page 1 and 2. How can write a script that moves only moves forward? I tried the following selector, but couldn't move from page 1 to 2.
NEXT_PAGE_SELECTOR = '//span[@class="page-link"]//span[contains(text(),"»")]/preceding-sibling::a/@href'
nextPageUrl = response.urljoin(response.xpath(NEXT_PAGE_SELECTOR).extract_first())
In page 1
<span class="page-link"><a href=".../page/2/"><span aria-hidden="true">»</span><span class="sr-only">Next page</span></a></span>
In page 2
<span class="page-link"><a href=".../page/1/"><span aria-hidden="true">«</span><span class="sr-only">Previous page</span></a></span>
Thanks
Solution
It's hard to debug what happened when you use NEXT_PAGE_SELECTOR. There is another more easy way to go over all page you need. You can use "parse" method of CrawlSpider. Inside of "parse" method you can get data from the page and then take next page URL to make a yield with callback equal self.parse. It will open next page URL and run "parse" method again with next page URL response.
from scrapy.spiders import CrawlSpider
class SomeSpider(CrawlSpider):
name = 'SAME NAME'
allowed_domains = ['ALLOWED DOMAINS HERE']
start_urls = ['START_URL'
def parse(self, response):
# First you get all data from current page.
urls = response.css('div.title a::attr(href)').extract()
for url in urls:
yield response.follow(url, callback=self.parse_data_page)
# Second you get next page URL and yield it with callback.
next_page = response.css('span.page-link a::attr(href)').extract_first()
yield response.follow(next_page, callback=self.parse)
def parse_data_page(self, response):
# Pars
Answered By - Oleg T.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.