Saturday, October 15, 2022

[FIXED] scrapy returning an empty object

October 15, 2022 css-selectors, python, scrapy, web-crawler No comments

Issue

i am using css selector and continually get a response with empty values. Here is the code.

import scrapy 

class WebSpider(scrapy.Spider):
name = 'activities'
start_urls = [
    'http://capetown.travel/events/'
]


def parse(self, response):
    all_div_activities = response.css("div.tribe-events-content")#gdlr-core-pbf-column gdlr-core-column-60 gdlr-core-column-first
    title = all_div_activities.css("h2.tribe-events-list-event-title::text").extract()#gdlr-core-text-box-item-content
    price = all_div_activities.css(".span.ticket-cost::text").extract()
    details = all_div_activities.css(".p::text").extract()
    yield {
        'title':title,
        'price':price,
        'details':details
    }

Solution

In your code you're looking to select all events but that output will be a list and you can't select the title etc using extract() with a list as you are trying to do.

This is why you're not getting the data you want. You will need to use a for loop to loop over each event on the page in your case looping over all_div_activities.

Code for Script

def parse(self,response):
    all_div_activities = response.css('div.tribe-events-event-content')
    for a in all_div_activities:
        title = a.css('a.tribe-event-url::text').get()

        if a.css('span.ticket-cost::text'):
            price = a.css('span.ticket-cost::text').get()
        else: 
            price = 'No price'

        details = a.css('div[class*="tribe-events-list-event-description"] > p::text').get()

        yield { 
               'title':title.strip(),
                'price':price,
                'details':details
              }

Notes

Using an if statement for price because there were elements that had no price at all and so inputting some information is a good idea.
Using strip() on title when yielding the dictionary as the title had space and \n attached.

Advice

As a minor point, Scrapy suggests using get() and getall() methods rather than extract_first() and extract(). With extract() its not always possible to know the output is going to be a list or not, in this case the output I got was a list. This is why scrapy docs suggests using get() instead. It's also abit more compact. With get() you will always get a string. This also meant that I could strip newlines and space with the title as you can see in the above code.

Another tip would be if the class attribute is quite long, use a *= selector as long as the partial attribute you select provides a unique result to the data you want. See here for abit more detail here.

Using items instead of yielding a dictionary may be better in the longrun, as you can set default values for data that in some events on the page you're scraping and other events it's not. You have to do this through a pipeline (again if you don't understand this then don't worry). See the docs for items and here for abit more on items.

Answered By - AaronS

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, October 15, 2022

[FIXED] scrapy returning an empty object

Issue

Solution

Code for Script

Advice

0 comments:

Post a Comment

Popular Posts

Labels