Monday, January 29, 2024

[FIXED] Python Scrapy to scrap a dynamically loaded website

January 29, 2024 dynamic, python, scrapy, web-scraping, xmlhttprequest No comments

Issue

I am currently working on a web scraping project using Scrapy to extract course information from https://www.discoveruni.gov.uk/course-finder/results/. I've encountered a challenge due to the website's dynamic loading behavior, which differs from my previous scraping experiences.

Initially, I successfully retrieved the information from the first page. However, when inspecting the website, I noticed that the XHR responses did not contain the expected JSON data; they were empty.

Here's a snippet of my current Scrapy spider:

class UnispiderSpider(scrapy.Spider):
    name = 'unispider'
    # allowed_domains = ['www.discoveruni.gov.uk']
    start_urls = ['https://www.discoveruni.gov.uk/course-finder/results/']
    base_url = 'https://www.discoveruni.gov.uk/course-finder/results/'

    def parse(self, response):
        course_list = response.xpath(
            '//div[@class="course-finder-results__result-accordion-body-content comparison-course-area mb-4"]')
        for course in course_list:
            courseidentifier = course.xpath('@data-courseidentifier').get()
            uniname = course.xpath('@data-uniname').get()
            uniid = course.xpath('@data-uniid').get()
            coursename = course.xpath('@data-coursename').get()
            link = course.xpath('a/@href').get()
            yield {
                'courseidentifier': courseidentifier,
                'uniname': uniname,
                'uniid': uniid,
                'coursename': coursename,
                'link': link
            }

My main concern is figuring out how to navigate to the next page and continue scraping. Since the XHR responses do not provide the expected JSON data, I'm unsure about the correct approach to handle the pagination.

Any insights or guidance on how to address this issue would be greatly appreciated.

Thank you!!!

I successfully retrieved the information from the first page by using scrapy but I do know how can i go to the next page.

Solution

While observing network activity, try pressing the All tab instead of XHR to see the post requests being made for the next page's content. This is one of the ways you can achieve that.

class UnispiderSpider(scrapy.Spider):
    name = 'unispider'
    # allowed_domains = ['www.discoveruni.gov.uk']
    start_urls = ['https://www.discoveruni.gov.uk/course-finder/results/']
    base_url = 'https://www.discoveruni.gov.uk/course-finder/results/'

    payload = {
        'count': '20',
        'sort_by_subject': 'false',
        'course_query': '',
        'location_radio': 'region',
    }

    def parse(self, response):
        if not response.css('.comparison-course-area'):
            return

        for course in response.css('.comparison-course-area'):
            yield {
                'courseidentifier': course.xpath('@data-courseidentifier').get(),
                'uniname': course.xpath('@data-uniname').get(),
                'uniid': course.xpath('@data-uniid').get(),
                'coursename': course.xpath('@data-coursename').get(),
                'link': course.xpath('a/@href').get()
            }

        next_page_num = response.meta.get("page",1) + 1
        self.payload['csrfmiddlewaretoken'] = response.css('[name="csrfmiddlewaretoken"]::attr(value)').get()
        self.payload['page'] = str(next_page_num)

        yield scrapy.FormRequest(
            self.base_url,
            method='POST',
            formdata=self.payload,
            callback=self.parse,
            meta={"page": next_page_num}
        )

Answered By - SIM

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 29, 2024

[FIXED] Python Scrapy to scrap a dynamically loaded website

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels