Wednesday, January 12, 2022

[FIXED] Python Sprapy - CSS Selector - Why using .get() intead .extract

January 12, 2022 css-selectors, python, scrapy No comments

Issue

First of all thanks for all support, that community has given to an old man, but a newbie on Python. My thanks.

I am doing a course, and I am trying to understand each word of the code. If something it´s not clear to me, I search for an answer.

I read the scrapy documentation and do not understand why the .get() instead of the other options.

I am doing a scrapy spider. Right now getting the next page.

The problem: why .get()

next_page = response.css('li.next a::attr(href)').get()

I Was Expecting

next_page = response.css('li.next a::attr(href)')

or...

next_page = response.css('li.next a::attr(href)').extract()

Here is the HTML Code

The HTML Code is just to clear the Information. You can access the quotes.toscrape.com

<li class="next">
    <a href="/page/2/">
    "Next "
        <span aria-hidden="true">→</span>
    </a>
</li>

Here is my full Spider code

I think the community does not need this but want to give more info as I can. Thanks.

import scrapy
from ..items import QuotetutorialItem

class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com'
    ]

    def parse(self, response):
        items = QuotetutorialItem()
        all_div_quotes = response.css('div.quote')
        for quotes in all_div_quotes:

            title = quotes.css('span.text::text').extract()
            author = quotes.css('.author::text').extract()
            tag = quotes.css('.tag::text').extract()

            items['title'] = title
            items['author'] = author
            items['tag'] = tag

            yield items

        next_page = response.css('li.next a::attr(href)').get()

        if next_page is not None:
            yield response.follow(next_page, callback= self.parse)

Thanks a lot.

Sorry for another dumb question.

I can delete this post if it´s better for Stack Overflow.

Solution

From the docs

If you’re a long-time Scrapy user, you’re probably familiar with .extract() and .extract_first() selector methods. Many blog posts and tutorials are using them as well. These methods are still supported by Scrapy, there are no plans to deprecate them. However, Scrapy usage docs are now written using .get() and .getall() methods. We feel that these new methods result in a more concise and readable code.

If you don't use get(), getall(), extract() or extract_first() then you only have a query.
For example:

response.css('li.next a::attr(href)') tells what you want to get. It's a query. Think of it as "please find these elements in the html". The spider will find them, but you need to get them if you want to assign them to a variable. So you use get() if you want only one result, getAll() if you want all the results. You can also use extract() and extract_first().
So the final result is:

next_page = response.css('li.next a::attr(href)').get()

which gets you the url for the next page.

Answered By - whichperson

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 12, 2022

[FIXED] Python Sprapy - CSS Selector - Why using .get() intead .extract

Issue

The problem: why .get()

I Was Expecting

Here is the HTML Code

Here is my full Spider code

Solution

0 comments:

Post a Comment

Popular Posts

Labels