Friday, November 26, 2021

[FIXED] How to simulate pressing buttons to keep on scraping more elements with Scrapy

November 26, 2021 scrapy No comments

Issue

On this page (https://www.realestate.com.kh/buy/), I managed to grab a list of ads, and individually parse their content with this code:

import scrapy

class scrapingThings(scrapy.Spider):
    name = 'scrapingThings'
    # allowed_domains = ['https://www.realestate.com.kh/buy/']
    start_urls = ['https://www.realestate.com.kh/buy/']

    def parse(self, response):
        ads = response.xpath('//*[@class="featured css-ineky e1jqslr40"]//a/@href')
        c = 0
        for url in ads:
            c += 1
            absolute_url = response.urljoin(url.extract())
            self.item = {}
            self.item['url'] = absolute_url

            yield scrapy.Request(absolute_url, callback=self.parse_ad, meta={'item': self.item})

    def parse_ad(self, response):

        # Extract things

        yield {
            # Yield things
        }

However, I'd like to automate the switching from one page to another to grab the entirety of the ads available (not only on the first page, but on all pages). By, I guess, simulating the pressings of the 1, 2, 3, 4, ..., 50 buttons as displayed on this screen capture:

Is this even possible with Scrapy? If so, how can one achieve this?

Solution

Yes it's possible. Let me show you two ways of doing it:

You can have your spider select the buttons, get the @href value of them, build a [full] URL and yield as a new request. Here is an example:

def parse(self, response):
    ....
    href = response.xpath('//div[@class="desktop-buttons"]/a[@class="css-owq2hj"]/following-sibling::a[1]/@href').get()
    req_url = response.urljoin(href)
    yield Request(url=req_url, callback=self.parse_ad)

The selector in the example will always return the @href of the next page's button (It returns only one value, if you are in page 2 it returns the @href of page 2)
In this page, the href is an relative url, so we need to use response.urljoin() method to build a full url. It will use the response as base.
We yield a new request, the response will be parsed in the callback function you determined.
This will require your callback function to always yield the request for the next page. So it's a recursive solution.

A more simple approach would be to just observe the pattern of the hrefs and manually yield all requests. Each button has a href of "/buy/?page={nr}" where {nr} is the number of the page, se can arbitrarily change this nr value and yield all requests at once.

def parse(self, response):
    ....
    nr_pages = response.xpath('//div[@class="desktop-buttons"]/a[@class="css-1en2dru"]//text()').getall()
    last_page_nr = int(nr_pages[-1])
    for nr in range(2, last_page_nr + 1):
        req_url = f'/buy/?page={nr}'
        yield Request(url=response.urljoin(req_url), callback=self.parse_ad)

nr_pages returns the number of all buttons
last_page_nr selects the last number (which is the last available page)
We loop in the range between 2 to the value of last_page_nr (50 in this case) and in each loop we request a new page (that correspond to the number).
This way you can make all the requests in your parse method, and parse the response in the parse_ad without recursive calling.

Finally I suggest you take a look on scrapy tutorial it covers several common scenarios on scraping.

Answered By - renatodvc

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, November 26, 2021

[FIXED] How to simulate pressing buttons to keep on scraping more elements with Scrapy

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels