Friday, June 3, 2022

[FIXED] How to correclty loop links with Scrapy?

June 03, 2022 python, scrapy No comments

Issue

I'm using Scrapy and I'm having some problems while loop through a link.

I'm scraping the majority of information from one single page except one which points to another page.

There are 10 articles on each page. For each article I have to get the abstract which is on a second page. The correspondence between articles and abstracts is 1:1.

Here the divsection I'm using to scrape the data:

<div class="articleEntry">
    <div class="tocArticleEntry include-metrics-panel toc-article-tools">
        <div class="item-checkbox-container" role="checkbox" aria-checked="false" aria-labelledby="article-d401999e88">
            <label tabindex="0" class="checkbox--primary"><input type="checkbox"
                    name="10.1080/03066150.2021.1956473"><span class="box-btn"></span></label></div><span
            class="article-type">Article</span>
        <div class="art_title linkable"><a class="ref nowrap" href="/doi/full/10.1080/03066150.2021.1956473"><span
                    class="hlFld-Title" id="article-d401999e88">Climate change and agrarian struggles: an invitation to
                    contribute to a <i>JPS</i> Forum</span></a></div>
        <div class="tocentryright">
            <div class="tocAuthors afterTitle">
                <div class="articleEntryAuthor all"><span class="articleEntryAuthorsLinks"><span><a
                                href="/author/Borras+Jr.%2C+Saturnino+M">Saturnino M. Borras Jr.</a></span>, <span><a
                                href="/author/Scoones%2C+Ian">Ian Scoones</a></span>, <span><a
                                href="/author/Baviskar%2C+Amita">Amita Baviskar</a></span>, <span><a
                                href="/author/Edelman%2C+Marc">Marc Edelman</a></span>, <span><a
                                href="/author/Peluso%2C+Nancy+Lee">Nancy Lee Peluso</a></span> &amp; <span><a
                                href="/author/Wolford%2C+Wendy">Wendy Wolford</a></span></span></div>
            </div>
            <div class="tocPageRange maintextleft">Pages: 1-28</div>
            <div class="tocEPubDate"><span class="maintextleft"><strong>Published online:</strong><span class="date"> 06
                        Aug 2021</span></span></div>
        </div>
        <div class="sfxLinkButton"></div>
        <div class="tocDeliverFormatsLinks"><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
                class="ref nowrap full" href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
                class="ref nowrap references" href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
                class="ref nowrap nocolwiz" target="_blank" title="Opens new window"
                href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a class="ref nowrap epub"
                href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
                href="/servlet/linkout?type=rightslink&amp;url=startPage%3D1%26pageCount%3D28%26author%3DSaturnino%2BM.%2BBorras%2BJr.%252C%2B%252C%2BIan%2BScoones%252C%2Bet%2Bal%26orderBeanReset%3Dtrue%26imprint%3DRoutledge%26volumeNum%3D49%26issueNum%3D1%26contentID%3D10.1080%252F03066150.2021.1956473%26title%3DClimate%2Bchange%2Band%2Bagrarian%2Bstruggles%253A%2Ban%2Binvitation%2Bto%2Bcontribute%2Bto%2Ba%2BJPS%2BForum%26numPages%3D28%26pa%3D%26oa%3DCC-BY-NC-ND%26issn%3D0306-6150%26publisherName%3Dtandfuk%26publication%3DFJPS%26rpt%3Dn%26endPage%3D28%26publicationDate%3D01%252F02%252F2022"
                class="rightslink" target="_blank" title="Opens new window">Permissions</a>\xa0</div>
        <div class="metrics-panel">
            <ul class="altmetric-score true">
                <li><span>6049</span> Views</li>
                <li><span>0</span> CrossRef citations</li>
                <li class="value" data-doi="10.1080/03066150.2021.1956473"><span class="metrics-score">0</span>Altmetric
                </li>
            </ul>
        </div><span class="access-icon oa" role="img" aria-label="Access provided by Open Access"></span><span
            class="part-tooltip">Open Access</span>
    </div>
</div>

To do so I have defined the following script

from cgitb import text
import scrapy
import pandas as pd


class QuotesSpider(scrapy.Spider):
    name = "jps"

    start_urls = ['https://www.tandfonline.com/toc/fjps20/current']
    

    def parse(self, response):
        self.logger.info('hello this is my first spider')
        Title = response.xpath("//span[@class='hlFld-Title']").extract()
        Authors = response.xpath("//span[@class='articleEntryAuthorsLinks']").extract()
        License = response.xpath("//span[@class='part-tooltip']").extract()
        abstract_url = response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract()
        row_data = zip(Title, Authors, License, abstract_url)
        
        for quote in row_data:
            scraped_info = {
                # key:value
                'Title': quote[0],
                'Authors': quote[1],
                'License': quote[2],
                'Abstract': quote[3]
            }
            # yield/give the scraped info to scrapy
            yield scraped_info
    
    
    def parse_links(self, response):
        
        for links in response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract():
            yield scrapy.Request(links, callback=self.parse_abstract_page)
        #yield response.follow(abstract_url, callback=self.parse_abstract_page)
    
    def parse_abstract_page(self, response):
        Abstract = response.xpath("//div[@class='hlFld-Abstract']").extract_first()
        row_data = zip(Abstract)
        for quote in row_data:
            scraped_info_abstract = {
                # key:value
                'Abstract': quote[0]
            }
            # yield/give the scraped info to scrapy
            yield scraped_info_abstract

Authors, title and license are correctly scraped. For the Abstract I'm having the following error:

ValueError: Missing scheme in request url: /doi/abs/10.1080/03066150.2021.1956473

To check if the path was correct I removed the abstract_url from the loop:

 abstract_url = response.xpath('// [@class="tocDeliverFormatsLinks"]/a/@href').extract_first()
 self.logger.info('get abstract page url')
 yield response.follow(abstract_url, callback=self.parse_abstract)

I can correctly reach the abstract corresponding to the first article, but not the others. I think the error is in the loop.

How can I solve this issue?

Thanks

Solution

As @tgrnie has explained, the URL is a relative URL, which needs to be converted to an absolute URL.

Scrapy has a wrapper around urljoin, which is simply response.urljoin(). No additional imports are required. See official docs here.

So this line:

yield scrapy.Request(links, callback=self.parse_abstract_page)

can be modified like this:

yield scrapy.Request(response.urljoin(links), callback=self.parse_abstract_page)

The other approach is to use response.follow as you used in your code:

 yield response.follow(abstract_url, callback=self.parse_abstract)

If you want to follow all linked, use yield from along with follow_all as the following example:

yield from response.follow_all(list_of_urls, callback=self.parse_abstract)

The biggest difference between yield Request(URL) and yield response.follow(url) is that a relative URL will work with response.follow, while you must provide a complete URL to create a Request object.

See the documentation here.

Answered By - Upendra

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, June 3, 2022

[FIXED] How to correclty loop links with Scrapy?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels