Issue
I'm using Scrapy and I'm having some problems while loop through a link.
I'm scraping the majority of information from one single page except one which points to another page.
There are 10 articles on each page. For each article I have to get the abstract which is on a second page. The correspondence between articles and abstracts is 1:1.
Here the div
section I'm using to scrape the data:
<div class="articleEntry">
<div class="tocArticleEntry include-metrics-panel toc-article-tools">
<div class="item-checkbox-container" role="checkbox" aria-checked="false" aria-labelledby="article-d401999e88">
<label tabindex="0" class="checkbox--primary"><input type="checkbox"
name="10.1080/03066150.2021.1956473"><span class="box-btn"></span></label></div><span
class="article-type">Article</span>
<div class="art_title linkable"><a class="ref nowrap" href="/doi/full/10.1080/03066150.2021.1956473"><span
class="hlFld-Title" id="article-d401999e88">Climate change and agrarian struggles: an invitation to
contribute to a <i>JPS</i> Forum</span></a></div>
<div class="tocentryright">
<div class="tocAuthors afterTitle">
<div class="articleEntryAuthor all"><span class="articleEntryAuthorsLinks"><span><a
href="/author/Borras+Jr.%2C+Saturnino+M">Saturnino M. Borras Jr.</a></span>, <span><a
href="/author/Scoones%2C+Ian">Ian Scoones</a></span>, <span><a
href="/author/Baviskar%2C+Amita">Amita Baviskar</a></span>, <span><a
href="/author/Edelman%2C+Marc">Marc Edelman</a></span>, <span><a
href="/author/Peluso%2C+Nancy+Lee">Nancy Lee Peluso</a></span> & <span><a
href="/author/Wolford%2C+Wendy">Wendy Wolford</a></span></span></div>
</div>
<div class="tocPageRange maintextleft">Pages: 1-28</div>
<div class="tocEPubDate"><span class="maintextleft"><strong>Published online:</strong><span class="date"> 06
Aug 2021</span></span></div>
</div>
<div class="sfxLinkButton"></div>
<div class="tocDeliverFormatsLinks"><a href="/doi/abs/10.1080/03066150.2021.1956473">Abstract</a> | <a
class="ref nowrap full" href="/doi/full/10.1080/03066150.2021.1956473">Full Text</a> | <a
class="ref nowrap references" href="/doi/ref/10.1080/03066150.2021.1956473">References</a> | <a
class="ref nowrap nocolwiz" target="_blank" title="Opens new window"
href="/doi/pdf/10.1080/03066150.2021.1956473">PDF (2239 KB)</a> | <a class="ref nowrap epub"
href="/doi/epub/10.1080/03066150.2021.1956473" target="_blank">EPUB</a> | <a
href="/servlet/linkout?type=rightslink&url=startPage%3D1%26pageCount%3D28%26author%3DSaturnino%2BM.%2BBorras%2BJr.%252C%2B%252C%2BIan%2BScoones%252C%2Bet%2Bal%26orderBeanReset%3Dtrue%26imprint%3DRoutledge%26volumeNum%3D49%26issueNum%3D1%26contentID%3D10.1080%252F03066150.2021.1956473%26title%3DClimate%2Bchange%2Band%2Bagrarian%2Bstruggles%253A%2Ban%2Binvitation%2Bto%2Bcontribute%2Bto%2Ba%2BJPS%2BForum%26numPages%3D28%26pa%3D%26oa%3DCC-BY-NC-ND%26issn%3D0306-6150%26publisherName%3Dtandfuk%26publication%3DFJPS%26rpt%3Dn%26endPage%3D28%26publicationDate%3D01%252F02%252F2022"
class="rightslink" target="_blank" title="Opens new window">Permissions</a>\xa0</div>
<div class="metrics-panel">
<ul class="altmetric-score true">
<li><span>6049</span> Views</li>
<li><span>0</span> CrossRef citations</li>
<li class="value" data-doi="10.1080/03066150.2021.1956473"><span class="metrics-score">0</span>Altmetric
</li>
</ul>
</div><span class="access-icon oa" role="img" aria-label="Access provided by Open Access"></span><span
class="part-tooltip">Open Access</span>
</div>
</div>
To do so I have defined the following script
from cgitb import text
import scrapy
import pandas as pd
class QuotesSpider(scrapy.Spider):
name = "jps"
start_urls = ['https://www.tandfonline.com/toc/fjps20/current']
def parse(self, response):
self.logger.info('hello this is my first spider')
Title = response.xpath("//span[@class='hlFld-Title']").extract()
Authors = response.xpath("//span[@class='articleEntryAuthorsLinks']").extract()
License = response.xpath("//span[@class='part-tooltip']").extract()
abstract_url = response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract()
row_data = zip(Title, Authors, License, abstract_url)
for quote in row_data:
scraped_info = {
# key:value
'Title': quote[0],
'Authors': quote[1],
'License': quote[2],
'Abstract': quote[3]
}
# yield/give the scraped info to scrapy
yield scraped_info
def parse_links(self, response):
for links in response.xpath('//*[@class="tocDeliverFormatsLinks"]/a/@href').extract():
yield scrapy.Request(links, callback=self.parse_abstract_page)
#yield response.follow(abstract_url, callback=self.parse_abstract_page)
def parse_abstract_page(self, response):
Abstract = response.xpath("//div[@class='hlFld-Abstract']").extract_first()
row_data = zip(Abstract)
for quote in row_data:
scraped_info_abstract = {
# key:value
'Abstract': quote[0]
}
# yield/give the scraped info to scrapy
yield scraped_info_abstract
Authors, title and license are correctly scraped. For the Abstract I'm having the following error:
ValueError: Missing scheme in request url: /doi/abs/10.1080/03066150.2021.1956473
To check if the path was correct I removed the abstract_url
from the loop:
abstract_url = response.xpath('// [@class="tocDeliverFormatsLinks"]/a/@href').extract_first()
self.logger.info('get abstract page url')
yield response.follow(abstract_url, callback=self.parse_abstract)
I can correctly reach the abstract corresponding to the first article, but not the others. I think the error is in the loop.
How can I solve this issue?
Thanks
Solution
As @tgrnie has explained, the URL is a relative URL, which needs to be converted to an absolute URL.
Scrapy has a wrapper around urljoin
, which is simply response.urljoin()
. No additional imports are required. See official docs here.
So this line:
yield scrapy.Request(links, callback=self.parse_abstract_page)
can be modified like this:
yield scrapy.Request(response.urljoin(links), callback=self.parse_abstract_page)
The other approach is to use response.follow
as you used in your code:
yield response.follow(abstract_url, callback=self.parse_abstract)
If you want to follow all linked, use yield from
along with follow_all
as the following example:
yield from response.follow_all(list_of_urls, callback=self.parse_abstract)
The biggest difference between yield Request(URL)
and yield response.follow(url)
is that a relative URL will work with response.follow
, while you must provide a complete URL to create a Request
object.
See the documentation here.
Answered By - Upendra
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.