Issue
On this page (https://www.realestate.com.kh/buy/), I managed to grab a list of ads, and individually parse their content with this code:
import scrapy
class scrapingThings(scrapy.Spider):
name = 'scrapingThings'
# allowed_domains = ['https://www.realestate.com.kh/buy/']
start_urls = ['https://www.realestate.com.kh/buy/']
def parse(self, response):
ads = response.xpath('//*[@class="featured css-ineky e1jqslr40"]//a/@href')
c = 0
for url in ads:
c += 1
absolute_url = response.urljoin(url.extract())
self.item = {}
self.item['url'] = absolute_url
yield scrapy.Request(absolute_url, callback=self.parse_ad, meta={'item': self.item})
def parse_ad(self, response):
# Extract things
yield {
# Yield things
}
However, I'd like to automate the switching from one page to another to grab the entirety of the ads available (not only on the first page, but on all pages). By, I guess, simulating the pressings of the 1, 2, 3, 4, ..., 50 buttons as displayed on this screen capture:
Is this even possible with Scrapy? If so, how can one achieve this?
Solution
Yes it's possible. Let me show you two ways of doing it:
You can have your spider select the buttons, get the @href
value of them, build a [full] URL and yield
as a new request.
Here is an example:
def parse(self, response):
....
href = response.xpath('//div[@class="desktop-buttons"]/a[@class="css-owq2hj"]/following-sibling::a[1]/@href').get()
req_url = response.urljoin(href)
yield Request(url=req_url, callback=self.parse_ad)
- The selector in the example will always return the
@href
of the next page's button (It returns only one value, if you are in page 2 it returns the@href
of page 2) - In this page, the href is an relative url, so we need to use
response.urljoin()
method to build a full url. It will use the response as base. - We
yield
a new request, the response will be parsed in the callback function you determined. - This will require your callback function to always yield the request for the next page. So it's a recursive solution.
A more simple approach would be to just observe the pattern of the hrefs and manually yield
all requests. Each button has a href of "/buy/?page={nr}"
where {nr}
is the number of the page, se can arbitrarily change this nr value and yield all requests at once.
def parse(self, response):
....
nr_pages = response.xpath('//div[@class="desktop-buttons"]/a[@class="css-1en2dru"]//text()').getall()
last_page_nr = int(nr_pages[-1])
for nr in range(2, last_page_nr + 1):
req_url = f'/buy/?page={nr}'
yield Request(url=response.urljoin(req_url), callback=self.parse_ad)
nr_pages
returns the number of all buttonslast_page_nr
selects the last number (which is the last available page)- We loop in the range between 2 to the value of
last_page_nr
(50 in this case) and in each loop we request a new page (that correspond to the number). - This way you can make all the requests in your
parse
method, and parse the response in theparse_ad
without recursive calling.
Finally I suggest you take a look on scrapy tutorial it covers several common scenarios on scraping.
Answered By - renatodvc
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.