Issue
I am trying to scrape some data from Google Scholar with scrapy
, my code is the following:
import scrapy
class TryscraperSpider(scrapy.Spider):
name = 'tryscraper'
start_urls = ['https://scholar.google.com/citations?hl=en&user=JUn8PgwAAAAJ&pagesize=100&view_op=list_works&sortby=pubdate']
def parse(self, response):
for link in response.css('a.gsc_a_at::attr(href)'):
yield response.follow(link.get(), callback=self.parse_scholar)
def parse_scholar(self, response):
try:
yield {
'authors': response.css('div.gsc_oci_value::text').get().strip(),
'journal': response.css('div.gsc_oci_value::text').extract()[2].strip(),
'date': response.css('div.gsc_oci_value::text').extract()[1].strip(),
'abstract': response.css('div.gsh_csp::text').get()
}
except:
yield {
'authors': response.css('div.gsc_oci_value::text').get().strip(),
'journal': response.css('div.gsc_oci_value::text').extract()[2].strip(),
'date': response.css('div.gsc_oci_value::text').extract()[1].strip(),
'abstract': 'NA'
}
This code works well, but it only gives me the first 100 papers from the author, I would like to scrape them all, but I would need to code the spider to also press the button "Show More". I have seen in related posts that scrapy
does not have built in functions to do so, but that maybe you can incorporate functionalities from selenium
to do the job. Unfortunately, I am a bit of a novice and therefore completely lost, any suggestions? Thanks in advance.
Here there is the selenium
code that should do the job, but I would like it to combine it with my scrapy
spider, which works well and it's very fast.
Solution
Check out the following implementation. This should give you all the results from that page exhausting show more
button.
import scrapy
import urllib
from scrapy import Selector
class ScholarSpider(scrapy.Spider):
name = 'scholar'
start_url = 'https://scholar.google.com/citations?'
params = {
'hl': 'en',
'user': 'JUn8PgwAAAAJ',
'view_op': 'list_works',
'sortby': 'pubdate',
'cstart': 0,
'pagesize': '100'
}
def start_requests(self):
req_url = f"{self.start_url}{urllib.parse.urlencode(self.params)}"
yield scrapy.FormRequest(req_url,formdata={'json':'1'},callback=self.parse)
def parse(self, response):
if not response.json()['B']:
return
resp = Selector(text=response.json()['B'])
for item in resp.css("tr > td > a[href^='/citations']::attr(href)").getall():
inner_link = f"https://scholar.google.com{item}"
yield scrapy.Request(inner_link,callback=self.parse_content)
self.params['cstart']+=100
req_url = f"{self.start_url}{urllib.parse.urlencode(self.params)}"
yield scrapy.FormRequest(req_url,formdata={'json':'1'},callback=self.parse)
def parse_content(self,response):
yield {
'authors': response.css(".gsc_oci_field:contains('Author') + .gsc_oci_value::text").get(),
'journal': response.css(".gsc_oci_field:contains('Journal') + .gsc_oci_value::text").get(),
'date': response.css(".gsc_oci_field:contains('Publication date') + .gsc_oci_value::text").get(),
'abstract': response.css("#gsc_oci_descr .gsh_csp::text").get()
}
Answered By - SIM
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.