Monday, March 14, 2022

[FIXED] How to use css selector in object from HtmlResponse

March 14, 2022 python-3.x, scrapy No comments

Issue

I'm currently developing an application using Scrapy.

I want to get some value using CSS selector out of def parse, So I create a HtmlResponse object first and tried to get some value using css(), But I can't get any value...

Within def parse, I can get the value in the same way.

What should I do if it is outside of def parse?

Here is the code:

import scrapy
from scrapy.http import HtmlResponse


class SampleSpider(scrapy.Spider):

    name = 'sample'
    allowed_domains = ['sample.com']
    start_urls = ['https://sample.com/search']

    my_response = HtmlResponse(url=start_urls[0])

    print('HtmlResponse')
    print(my_response)

    h3s = my_response.css('h3')

    print(str(len(h3s)))

    print('----------')

    def parse(self, response, **kwargs):

        print('def parse')
        print(response)

        h3s = response.css('h3')

        print(str(len(h3s)))

Console display：

HtmlResponse
<200 https://sample.com/search>
0 # <- I want to show '3' here
----------
def parse
<200 https://sample.com/search>
3

update

The program I want to finally create is the following code:

[ (Note) The code below does not work for reference ]

import scrapy
from scrapy.http import HtmlResponse


class SampleSpider(scrapy.Spider):

    name = 'sample'
    allowed_domains = ['sample.com']
    start_urls = []
    response_url = 'https://sample.com/search'

    my_response = HtmlResponse(url=response_url)
    categories = my_response.css('.categories a::attr(href)').getall()

    for category in categories:
        start_urls.append(category)

    def parse(self, response, **kwargs):
        
        pages = response.css('h3')

        for page in pages:
            print(page.css('::text').get())

Python 3.8.5

Scrapy 2.5.0

Solution

I know what do you mean,your start url is the basic domain,but you also want to fetch all category page to extract h3.
in scrapy you can extract data and follow new links in the same parse method,here is a example.

import scrapy


class SampleSpider(scrapy.Spider):

    name = 'sample'
    allowed_domains = ['sample.com']
    start_urls = ['https://sample.com/search']

    def parse(self, response, **kwargs):

        print('def parse')
        print(response)

        pages = response.css('h3')

        #extract data at here
        for page in pages:
            print(page.css('::text').get())
            yield page.css('::text').get()
        
        #follow new links here
        categories = response.css('.categories a::attr(href)').getall()
        for category in categories:
            yield scrapy.Request(category,callback=self.parse)

you can read scrapy document for more information

Answered By - nay

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, March 14, 2022

[FIXED] How to use css selector in object from HtmlResponse

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels