Issue
I want to use Scrapy to extract the titles of different books in a url and output/store them as an array of dictionaries in a json file.
Here is my code:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
star_urls = [
"http://books.toscrape.com"
]
def parse(self, response):
titles = response.css("article.product_pod h3 a::attr(title)").getall()
for title in titles:
yield {"title": title}
Here is what i put in the terminal:
scrapy crawl books -o books.json
The books.json file is created but is empty.
I checked that I was in the right directory and venv but it still doesn't work.
However:
Earlier, i deployed this spider to scrape the whole html data and write it to a books.html file and everything worked.
Here is my code for this:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
star_urls = [
"http://books.toscrape.com"
]
def parse(self, response):
with open("books.html", "wb") as file:
file.write(response.body)
and here is what I put in my terminal:
scrapy crawl books
Any ideas on what I'm doing wrong? Thanks
Edit:
inputing response.css('article.product_pod h3 a::attr(title)').getall()
into the scrapy shell outputs:
['A Light in the Attic', 'Tipping the Velvet', 'Soumission', 'Sharp Objects', 'Sapiens: A Brief History of Humankind', 'The Requiem Red', 'The Dirty Little Secrets of Getting Your Dream Job', 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'The Black Maria', 'Starving Hearts (Triangular Trade Trilogy, #1)', "Shakespeare's Sonnets", 'Set Me Free', "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'Rip it Up and Start Again', 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'Olio', 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'Libertarianism for Beginners', "It's Only the Himalayas"]
Solution
Now run the code.It should work
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://books.toscrape.com/']
def parse(self, response):
titles = response.css('.product_pod')
for title in titles:
yield {
"title": title.css('h3 a::attr(title)').get()
#"title": title.css('h3 a::text').get()
}
Output:
[
{
"title": "A Light in the Attic"
},
{
"title": "Tipping the Velvet"
},
{
"title": "Soumission"
},
{
"title": "Sharp Objects"
},
{
"title": "Sapiens: A Brief History of Humankind"
},
{
"title": "The Requiem Red"
},
{
"title": "The Dirty Little Secrets of Getting Your Dream Job"
},
{
"title": "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
},
{
"title": "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
},
{
"title": "The Black Maria"
},
{
"title": "Starving Hearts (Triangular Trade Trilogy, #1)"
},
{
"title": "Shakespeare's Sonnets"
},
{
"title": "Set Me Free"
},
{
"title": "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
},
{
"title": "Rip it Up and Start Again"
},
{
"title": "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
},
{
"title": "Olio"
},
{
"title": "Mesaerion: The Best Science Fiction Stories 1800-1849"
},
{
"title": "Libertarianism for Beginners"
},
{
"title": "It's Only the Himalayas"
}
]
Answered By - Fazlul
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.