Issue
I am currently working on a web scraping project using Scrapy to extract course information from https://www.discoveruni.gov.uk/course-finder/results/. I've encountered a challenge due to the website's dynamic loading behavior, which differs from my previous scraping experiences.
Initially, I successfully retrieved the information from the first page. However, when inspecting the website, I noticed that the XHR responses did not contain the expected JSON data; they were empty.
Here's a snippet of my current Scrapy spider:
class UnispiderSpider(scrapy.Spider):
name = 'unispider'
# allowed_domains = ['www.discoveruni.gov.uk']
start_urls = ['https://www.discoveruni.gov.uk/course-finder/results/']
base_url = 'https://www.discoveruni.gov.uk/course-finder/results/'
def parse(self, response):
course_list = response.xpath(
'//div[@class="course-finder-results__result-accordion-body-content comparison-course-area mb-4"]')
for course in course_list:
courseidentifier = course.xpath('@data-courseidentifier').get()
uniname = course.xpath('@data-uniname').get()
uniid = course.xpath('@data-uniid').get()
coursename = course.xpath('@data-coursename').get()
link = course.xpath('a/@href').get()
yield {
'courseidentifier': courseidentifier,
'uniname': uniname,
'uniid': uniid,
'coursename': coursename,
'link': link
}
My main concern is figuring out how to navigate to the next page and continue scraping. Since the XHR responses do not provide the expected JSON data, I'm unsure about the correct approach to handle the pagination.
Any insights or guidance on how to address this issue would be greatly appreciated.
Thank you!!!
I successfully retrieved the information from the first page by using scrapy but I do know how can i go to the next page.
Solution
While observing network activity, try pressing the All
tab instead of XHR
to see the post requests being made for the next page's content. This is one of the ways you can achieve that.
class UnispiderSpider(scrapy.Spider):
name = 'unispider'
# allowed_domains = ['www.discoveruni.gov.uk']
start_urls = ['https://www.discoveruni.gov.uk/course-finder/results/']
base_url = 'https://www.discoveruni.gov.uk/course-finder/results/'
payload = {
'count': '20',
'sort_by_subject': 'false',
'course_query': '',
'location_radio': 'region',
}
def parse(self, response):
if not response.css('.comparison-course-area'):
return
for course in response.css('.comparison-course-area'):
yield {
'courseidentifier': course.xpath('@data-courseidentifier').get(),
'uniname': course.xpath('@data-uniname').get(),
'uniid': course.xpath('@data-uniid').get(),
'coursename': course.xpath('@data-coursename').get(),
'link': course.xpath('a/@href').get()
}
next_page_num = response.meta.get("page",1) + 1
self.payload['csrfmiddlewaretoken'] = response.css('[name="csrfmiddlewaretoken"]::attr(value)').get()
self.payload['page'] = str(next_page_num)
yield scrapy.FormRequest(
self.base_url,
method='POST',
formdata=self.payload,
callback=self.parse,
meta={"page": next_page_num}
)
Answered By - SIM
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.