Issue
I'm currently in the works of a Google Scholar scraper that is supposed to iterate through several queries within a span of several years, and return the first 30 items of each year which is written in a formatted csv file. However, every time I run the program, there are some instances where the next_page variable is None when response.xpath is called, even though the urls are the same for each Request except the year is changed.
Below is the body of the Spider:
class ExampleSpider(scrapy.Spider):
name = 'worktime'
allowed_domains = ['api.scraperapi.com']
years = [2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011,
2010, 2009, 2008, 2007, 2006, 2005]
query = ('(Extinct OR Extinction) AND ("Loxodonta africana" OR "african '
'elephant")')
start_urls = ['https://scholar.google.com/scholar?']
def yield_year(self):
if self.years:
year = self.years.pop()
url = 'https://scholar.google.com/scholar?' + urlencode({
'hl': 'en', 'q': self.query, 'as_ylo': str(year), 'as_yhi':
str(
year)})
return scrapy.Request(get_url(url), self.parse_item_list, meta={
'position': 0})
else:
print("All done")
def parse(self, response):
print(response.url)
yield self.yield_year()
def parse_item_list(self, response):
position = response.meta['position']
year_published = response.url[-4:]
for res in response.xpath('//*[@data-rp]'):
link = res.xpath('.//h3/a/@href').extract_first()
temp = res.xpath('.//h3/a//text()').extract()
if not temp:
title = "[C] " + "".join(
res.xpath('.//h3/span[@id]//text()').extract())
else:
title = "".join(temp)
# snippet = "".join(
# res.xpath('.//*[@class="gs_rs"]//text()').extract())
# cited = res.xpath(
# './/a[starts-with(text(),"Cited")]/text()').extract_first()
# temp = res.xpath(
# './/a[starts-with(text(),"Related")]/@href').extract_first()
# related = "https://scholar.google.com" + temp if temp else ""
# num_versions = res.xpath(
# './/a[contains(text(),"version")]/text()').extract_first()
published_data = "".join(
res.xpath('.//div[@class="gs_a"]//text()').extract())
position += 1
item = {'Title': title, 'Author': published_data,
'Year': year_published}
yield item
# URL of the next page
next_page = response.xpath('//td[@align="left"]/a/@href').extract_first()
if position < 30 and next_page is not None:
url = "https://scholar.google.com" + next_page
yield scrapy.Request(get_url(url), self.parse_item_list, meta={'position': position})
else:
yield self.yield_year()
How can I ensure that the scraper returns a url for next_page without having to hardcode a link to the next page onto the parse_item_list function?
Solution
UPDATE: Issue was resolved. I needed to implement a try-except where the scraper would attempt to extract a url for the next page, and if it did not extract a link, the program would throw a TypeError and yield a request with the current link with dont_filter set to True. I had also added a retry_counter so that if there was no link found after 3 attempts, then it is most likely because there is no next page, so we move on to the next query.
try:
if self.page_num < 3 and self.retry_counter < 3:
next_page = response.xpath('.//div[@id="gs_nml"]/a['
'starts-with(text(),' + str(
(self.page_num + 1)) + ')]/@href').extract_first()
if next_page is not None:
self.page_num += 1
else:
raise TypeError
except TypeError:
print("I got no next page link! Trying again just in case.")
self.retry_counter += 1
yield scrapy.Request(response.url, callback=self.parse_item_list,
meta={'position': response.meta['position']},
dont_filter=True)
Answered By - Pab1311
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.