Issue
First of all thanks for all support, that community has given to an old man, but a newbie on Python. My thanks.
I am doing a course, and I am trying to understand each word of the code. If something it´s not clear to me, I search for an answer.
I read the scrapy documentation and do not understand why the .get()
instead of the other options.
I am doing a scrapy spider. Right now getting the next page.
The problem: why .get()
next_page = response.css('li.next a::attr(href)').get()
I Was Expecting
next_page = response.css('li.next a::attr(href)')
or...
next_page = response.css('li.next a::attr(href)').extract()
Here is the HTML Code
The HTML Code is just to clear the Information. You can access the quotes.toscrape.com
<li class="next">
<a href="/page/2/">
"Next "
<span aria-hidden="true">→</span>
</a>
</li>
Here is my full Spider code
I think the community does not need this but want to give more info as I can. Thanks.
import scrapy
from ..items import QuotetutorialItem
class QuoteSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com'
]
def parse(self, response):
items = QuotetutorialItem()
all_div_quotes = response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback= self.parse)
Thanks a lot.
Sorry for another dumb question.
I can delete this post if it´s better for Stack Overflow.
Solution
From the docs
If you’re a long-time Scrapy user, you’re probably familiar with .extract() and
.extract_first()
selector methods. Many blog posts and tutorials are using them as well. These methods are still supported by Scrapy, there are no plans to deprecate them. However, Scrapy usage docs are now written using.get()
and.getall()
methods. We feel that these new methods result in a more concise and readable code.
If you don't use get()
, getall()
, extract()
or extract_first()
then you only have a query.
For example:
response.css('li.next a::attr(href)')
tells what you want to get. It's a query. Think of it as "please find these elements in the html". The spider will find them, but you need to get them if you want to assign them to a variable. So you use
get()
if you want only one result, getAll()
if you want all the results. You can also use extract()
and extract_first()
.
So the final result is:
next_page = response.css('li.next a::attr(href)').get()
which gets you the url for the next page.
Answered By - whichperson
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.