Monday, January 24, 2022

[FIXED] Is there a way of stopping next_page from being None?

January 24, 2022 python, scrapy, web-scraping No comments

Issue

I'm currently in the works of a Google Scholar scraper that is supposed to iterate through several queries within a span of several years, and return the first 30 items of each year which is written in a formatted csv file. However, every time I run the program, there are some instances where the next_page variable is None when response.xpath is called, even though the urls are the same for each Request except the year is changed.

Below is the body of the Spider:

class ExampleSpider(scrapy.Spider):
    name = 'worktime'
    allowed_domains = ['api.scraperapi.com']
    years = [2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011,
             2010, 2009, 2008, 2007, 2006, 2005]
    query = ('(Extinct OR Extinction) AND ("Loxodonta africana" OR "african '
             'elephant")')
    start_urls = ['https://scholar.google.com/scholar?']

    def yield_year(self):
        if self.years:
            year = self.years.pop()
            url = 'https://scholar.google.com/scholar?' + urlencode({
                'hl': 'en', 'q': self.query, 'as_ylo': str(year), 'as_yhi':
                    str(
                    year)})
            return scrapy.Request(get_url(url), self.parse_item_list, meta={
                'position': 0})
        else:
            print("All done")

    def parse(self, response):
        print(response.url)
        yield self.yield_year()

    def parse_item_list(self, response):
        position = response.meta['position']
        year_published = response.url[-4:]
        for res in response.xpath('//*[@data-rp]'):
            link = res.xpath('.//h3/a/@href').extract_first()
            temp = res.xpath('.//h3/a//text()').extract()
            if not temp:
                title = "[C] " + "".join(
                    res.xpath('.//h3/span[@id]//text()').extract())
            else:
                title = "".join(temp)
            # snippet = "".join(
            # res.xpath('.//*[@class="gs_rs"]//text()').extract())
            # cited = res.xpath(
            # './/a[starts-with(text(),"Cited")]/text()').extract_first()
            # temp = res.xpath(
            # './/a[starts-with(text(),"Related")]/@href').extract_first()
            # related = "https://scholar.google.com" + temp if temp else ""
            # num_versions = res.xpath(
            # './/a[contains(text(),"version")]/text()').extract_first()
            published_data = "".join(
                res.xpath('.//div[@class="gs_a"]//text()').extract())
            position += 1
            item = {'Title': title, 'Author': published_data,
                    'Year': year_published}
            yield item

        # URL of the next page
        next_page = response.xpath('//td[@align="left"]/a/@href').extract_first()

        if position < 30 and next_page is not None:
            url = "https://scholar.google.com" + next_page
            yield scrapy.Request(get_url(url), self.parse_item_list, meta={'position': position})

        else:
            yield self.yield_year()

How can I ensure that the scraper returns a url for next_page without having to hardcode a link to the next page onto the parse_item_list function?

Solution

UPDATE: Issue was resolved. I needed to implement a try-except where the scraper would attempt to extract a url for the next page, and if it did not extract a link, the program would throw a TypeError and yield a request with the current link with dont_filter set to True. I had also added a retry_counter so that if there was no link found after 3 attempts, then it is most likely because there is no next page, so we move on to the next query.

            try:
                if self.page_num < 3 and self.retry_counter < 3:
                next_page = response.xpath('.//div[@id="gs_nml"]/a['
                                           'starts-with(text(),' + str(
                    (self.page_num + 1)) + ')]/@href').extract_first()

                if next_page is not None:
                    self.page_num += 1

                else:
                    raise TypeError

            except TypeError:
                print("I got no next page link! Trying again just in case.")
                self.retry_counter += 1
                yield scrapy.Request(response.url, callback=self.parse_item_list,
                                     meta={'position': response.meta['position']},
                                     dont_filter=True)

Answered By - Pab1311

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 24, 2022

[FIXED] Is there a way of stopping next_page from being None?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels