Issue
I'm scraping a google scholar profile page, and right now I have python code from the beautiful soup library which collects data from the page:
url = "https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en"
while True:
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'html.parser')
research_article = soup.find_all('tr',{'class':'gsc_a_tr'})
for research in research_article:
title = research.find('a',{'class':'gsc_a_at'}).text
authors = research.find('div',{'class':'gs_gray'}).text
print('Title:', title,'\n','\nAuthors:', authors)
I also have python code from the selenium library that automates the profile page to click the 'show more' button:
driver = webdriver.Chrome(executable_path ="/Applications/chromedriver84")
driver.get(url)
try:
#Wait up to 10s until the element is loaded on the page
element = WebDriverWait(driver, 10).until(
#Locate element by id
EC.presence_of_element_located((By.ID, 'gsc_bpf_more'))
)
finally:
element.click()
How can I combine these two blocks of code so that I can click the 'show more' button, and scrape the entire page? Thanks in advance!
Solution
This script will print all titles and authors from the page:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en'
api_url = 'https://scholar.google.com/citations?user={user}&hl=en&cstart={start}&pagesize={pagesize}'
user_id = re.search(r'user=(.*?)&', url).group(1)
start = 0
while True:
soup = BeautifulSoup( requests.post(api_url.format(user=user_id, start=start, pagesize=100)).content, 'html.parser' )
research_article = soup.find_all('tr',{'class':'gsc_a_tr'})
for i, research in enumerate(research_article, 1):
title = research.find('a',{'class':'gsc_a_at'})
authors = research.find('div',{'class':'gs_gray'})
print('{:04d} {:<80} {}'.format(start+i, title.text, authors.text))
if len(research_article) != 100:
break
start += 100
Prints:
0001 Hyper-heuristics: A Survey of the State of the Art EK Burke, M Hyde, G Kendall, G Ochoa, E Ozcan, R Qu
0002 Hyper-heuristics: An emerging direction in modern search technology E Burke, G Kendall, J Newall, E Hart, P Ross, S Schulenburg
0003 Search methodologies: introductory tutorials in optimization and decision support techniques E Burke, EK Burke, G Kendall
0004 A tabu-search hyperheuristic for timetabling and rostering EK Burke, G Kendall, E Soubeiga
0005 A hyperheuristic approach to scheduling a sales summit P Cowling, G Kendall, E Soubeiga
0006 A classification of hyper-heuristic approaches EK Burker, M Hyde, G Kendall, G Ochoa, E Özcan, JR Woodward
0007 Genetic algorithms K Sastry, D Goldberg, G Kendall
...
0431 Solution Methodologies for generating robust Airline Schedules F Bian, E Burke, S Jain, G Kendall, GM Koole, J Mulder, MCE Paelinck, ...
0432 A Triple objective function with a chebychev dynamic point specification approach to optimise the surface mount placement machine M Ayob, G Kendall
0433 A Library of Vehicle Routing Problems T Pigden, G Kendall, SD Ehsan, E Ozcan, R Eglese
0434 This symposium could not have taken place without the help of a great many people and organisations. We would like to thank the IEEE Computational Intelligence Society for … S Louis, G Kendall
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.