Issue
I am trying to capture the Ranking Criteria present in the following page using requests.get()
"https://www.topuniversities.com/universities/massachusetts-institute-technology-mit".
The entire HTML code apart from the Ranking Criteria (10 parameters) is fetched. I am unable to figure out why only that 1 section is not fetched.
import requests
from bs4 import BeautifulSoup
url = 'https://www.topuniversities.com/universities/massachusetts-institute-technology-mit'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"}
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.text,'lxml')
My goal is to capture the exact score along with its respective parameter from the link. For example,
parameters_list = ['overall','academic reputation',...]
value_list = ['100','100',...]
Solution
The data you see on the page is loaded from external URL via JavaScript. To simulate this request you can do:
import requests
from bs4 import BeautifulSoup
url = "https://www.topuniversities.com/qs-profiles/rank-data/513/410/0?_wrapper_format=drupal_ajax"
payload = {
"js": "true",
"_drupal_ajax": "1",
"ajax_page_state[theme]": "tu_d8",
"ajax_page_state[theme_token]": "",
"ajax_page_state[libraries]": "addtoany/addtoany.front,ckeditor_accordion/accordion.frontend,clientside_validation_jquery/cv.jquery.ckeditor,clientside_validation_jquery/cv.jquery.ife,clientside_validation_jquery/cv.jquery.validate,clientside_validation_jquery/cv.pattern.method,core/drupal.form,core/drupal.states,core/normalize,eu_cookie_compliance/eu_cookie_compliance_bare,flag/flag.link_ajax,ga/analytics,layout_discovery/onecol,qs_article/qs_article,qs_firebase_sso/sso-lib,qs_firebase_sso/sso-lib-header,qs_flexreg_user_flow/qs_flexreg_user_flow,qs_global_site_search/search_header,qs_profiles/highcharts,qs_profiles/qs_profiles,qs_profiles/qs_profiles_circle,qs_user_profile/qsUserProfile,system/base,tu_d8/global,tu_d8/node,tu_d8/profile_header,tu_d8/qna_forums,tu_d8/qs_campus_locations,tu_d8/qs_instant,tu_d8/qs_profile_new,tu_d8/qs_profile_new_datalayer,tu_d8/qs_program_tabs,tu_d8/qs_ranking_chart,tu_d8/qs_related_content,tu_d8/qs_similar_programs,views/views.module",
}
headers = {"X-Requested-With": "XMLHttpRequest"}
data = requests.post(url, data=payload, headers=headers).json()
soup = BeautifulSoup(data[-1]["data"], "html.parser")
for score in soup.select(".score"):
name = score.find_next("div", class_="itm-name").get_text(strip=True)
print(f"{name:<50} {score.get_text(strip=True)}")
Prints:
Overall 100
Academic Reputation 100
Employer Reputation 100
Faculty Student Ratio 100
Citations per Faculty 100
International Faculty Ratio 100
International Students Ratio 88.2
International Research Network 94.3
Employment Outcomes 100
Sustainability 95.2
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.