Issue
i am using css selector and continually get a response with empty values. Here is the code.
import scrapy
class WebSpider(scrapy.Spider):
name = 'activities'
start_urls = [
'http://capetown.travel/events/'
]
def parse(self, response):
all_div_activities = response.css("div.tribe-events-content")#gdlr-core-pbf-column gdlr-core-column-60 gdlr-core-column-first
title = all_div_activities.css("h2.tribe-events-list-event-title::text").extract()#gdlr-core-text-box-item-content
price = all_div_activities.css(".span.ticket-cost::text").extract()
details = all_div_activities.css(".p::text").extract()
yield {
'title':title,
'price':price,
'details':details
}
Solution
In your code you're looking to select all events but that output will be a list and you can't select the title etc using extract() with a list as you are trying to do.
This is why you're not getting the data you want. You will need to use a for loop to loop over each event on the page in your case looping over all_div_activities
.
Code for Script
def parse(self,response):
all_div_activities = response.css('div.tribe-events-event-content')
for a in all_div_activities:
title = a.css('a.tribe-event-url::text').get()
if a.css('span.ticket-cost::text'):
price = a.css('span.ticket-cost::text').get()
else:
price = 'No price'
details = a.css('div[class*="tribe-events-list-event-description"] > p::text').get()
yield {
'title':title.strip(),
'price':price,
'details':details
}
Notes
- Using an if statement for price because there were elements that had no price at all and so inputting some information is a good idea.
- Using strip() on title when yielding the dictionary as the title had space and \n attached.
Advice
As a minor point, Scrapy suggests using get() and getall() methods rather than extract_first() and extract(). With extract() its not always possible to know the output is going to be a list or not, in this case the output I got was a list. This is why scrapy docs suggests using get() instead. It's also abit more compact. With get() you will always get a string. This also meant that I could strip newlines and space with the title as you can see in the above code.
Another tip would be if the class attribute is quite long, use a *= selector as long as the partial attribute you select provides a unique result to the data you want. See here for abit more detail here.
Using items instead of yielding a dictionary may be better in the longrun, as you can set default values for data that in some events on the page you're scraping and other events it's not. You have to do this through a pipeline (again if you don't understand this then don't worry). See the docs for items and here for abit more on items.
Answered By - AaronS
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.