Issue
I am a newbie trying to scrape some quotes from goodreads.com but can't get the text = ...
part working properly.
I'm not sure what I'm missing so would appreciate some help.
for quote in response.css("div.quoteDetails"):
text = quote.css("div.quoteText:not(.authorOrTitle)::text").getall() # not getting <i>
author = quote.css("span.authorOrTitle::text").get().strip()
book = quote.css("a.authorOrTitle::text").get()
tags = quote.css("div.quoteFooter div.left a::text").getall()
print(dict(text=text, author=author, book=book, tags=tags))
I have tried some permutations like text = quote.css("div.quoteText :not(span):not(script) ::text").getall()
The closest I've managed is with text = quote.css("div.quoteText:not(.authorOrTitle)::text").getall()
which returned (missing <i>smiles all the time</i>
)
{'text': ["\n “God does not play dice with the universe; He plays an ineffable game of His own devising, which might be compared, from the perspective of any of the other players [i.e. everybody], to being involved in an obscure and complex variant of poker in a pitch-dark room, with blank cards, for infinite stakes, with a Dealer who won't tell you the rules, and who ", '.”\n ', ' ―\n ', '\n ', '\n \n\n\n', '\n\n'],
'author': 'Terry Pratchett,',
'book': 'Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch',
'tags': ['einstein', 'gaiman', 'god', 'humor']}
html snippet from the page I want to scrape https://www.goodreads.com/quotes/tag/god?page=1
<div class="quoteDetails ">
<a class="leftAlignedImage" href="/author/show/1654.Terry_Pratchett">
<img alt="Terry Pratchett" src="https://images.gr-assets.com/authors/1235562205p2/1654.jpg">
</a>
<div class="quoteText">
“God does not play dice with the universe; He plays an ineffable game of His own devising, which might be compared, from the perspective of any of the other players [i.e. everybody], to being involved in an obscure and complex variant of poker in a pitch-dark room, with blank cards, for infinite stakes, with a Dealer who won't tell you the rules, and who <i>smiles all the time</i>.”
<br> ―
<span class="authorOrTitle">
Terry Pratchett,
</span>
<span id="quote_book_link_12067">
<a class="authorOrTitle" href="/work/quotes/4110990">Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch</a>
</span>
<script>
//<![CDATA[
var newTip = new Tip($('quote_book_link_12067'), "\n\n <h2><a class=\"readable bookTitle\" href=\"https://www.goodreads.com/book/show/12067.Good_Omens?from_choice=false&from_home_module=false\">Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch<\/a><\/h2>\n\n <div>\n by <a class=\"authorName\" href=\"/author/show/1654.Terry_Pratchett\">Terry Pratchett<\/a>\n <\/div>\n\n <div class=\"smallText uitext darkGreyText\">\n <span class=\"minirating\"><span class=\"stars staticStars notranslate\"><span size=\"12x12\" class=\"staticStar p10\"><\/span><span size=\"12x12\" class=\"staticStar p10\"><\/span><span size=\"12x12\" class=\"staticStar p10\"><\/span><span size=\"12x12\" class=\"staticStar p10\"><\/span><span size=\"12x12\" class=\"staticStar p3\"><\/span><\/span> 4.24 avg rating — 595,193 ratings<\/span> — published 1990\n <\/div>\n\n <div class=\"addBookTipDescription\">\n \n<span id=\"freeTextContainer8297402955929030295\">‘Armageddon only happens once, you know. They don’t let you go around again until you get it right.’\n\nPeople have been predicting the end of the world almost from its very beginning, so it’s only natural to be sceptical when a new date is set for Jud<\/span>\n <span id=\"freeText8297402955929030295\" style=\"display:none\">‘Armageddon only happens once, you know. They don’t let you go around again until you get it right.’\n\nPeople have been predicting the end of the world almost from its very beginning, so it’s only natural to be sceptical when a new date is set for Judgement Day. But what if, for once, the predictions are right, and the apocalypse really is due to arrive next Saturday, just after tea?\n\nYou could spend the time left drowning your sorrows, giving away all your possessions in preparation for the rapture, or laughing it off as (hopefully) just another hoax. Or you could just try to do something about it.\n\nIt’s a predicament that Aziraphale, a somewhat fussy angel, and Crowley, a fast-living demon now finds themselves in. They’ve been living amongst Earth’s mortals since The Beginning and, truth be told, have grown rather fond of the lifestyle and, in all honesty, are not actually looking forward to the coming Apocalypse.\n\nAnd then there’s the small matter that someone appears to have misplaced the Antichrist…<\/span>\n <a data-text-id=\"8297402955929030295\" href=\"#\" onclick=\"swapContent(\$(this));; return false;\">...more<\/a>\n\n <\/div>\n\n <div class=\'wtrButtonContainer wtrSignedOut\' id=\'10_book_12067\'>\n<div class=\'wtrUp wtrLeft\'>\n<form action=\"/shelf/add_to_shelf\" accept-charset=\"UTF-8\" method=\"post\"><input name=\"utf8\" type=\"hidden\" value=\"✓\" /><input type=\"hidden\" name=\"authenticity_token\" value=\"eHxnso0cZ8mac56WZu5b+f9QTJuvU6CbzkAY03nf4vcdXMimazuX3oOLs4umgVsafX94R4eGUAdsRhvwmfdIaA==\" />\n<input type=\"hidden\" name=\"book_id\" id=\"book_id\" value=\"12067\" />\n<input type=\"hidden\" name=\"name\" id=\"name\" value=\"to-read\" />\n<input type=\"hidden\" name=\"unique_id\" id=\"unique_id\" value=\"10_book_12067\" />\n<input type=\"hidden\" name=\"wtr_new\" id=\"wtr_new\" value=\"true\" />\n<input type=\"hidden\" name=\"from_choice\" id=\"from_choice\" value=\"false\" />\n<input type=\"hidden\" name=\"from_home_module\" id=\"from_home_module\" value=\"false\" />\n<input type=\"hidden\" name=\"ref\" id=\"ref\" value=\"\" class=\"wtrLeftUpRef\" />\n<input type=\"hidden\" name=\"existing_review\" id=\"existing_review\" value=\"false\" class=\"wtrExisting\" />\n<input type=\"hidden\" name=\"page_url\" id=\"page_url\" value=\"/quotes/tag/god\" />\n<button class=\'wtrToRead\' type=\'submit\'>\n<span class=\'progressTrigger\'>Want to Read<\/span>\n<span class=\'progressIndicator\'>saving…<\/span>\n<\/button>\n<\/form>\n\n<\/div>\n\n<div class=\'wtrRight wtrUp\'>\n<form class=\"hiddenShelfForm\" action=\"/shelf/add_to_shelf\" accept-charset=\"UTF-8\" method=\"post\"><input name=\"utf8\" type=\"hidden\" value=\"✓\" /><input type=\"hidden\" name=\"authenticity_token\" value=\"SKP2jUje5gC8x8+/fKoz/fZEAz53hITka0SMTXDRT2Etg1mZrvkWF6U/4qK8xTMedGs34l9RdHjJQo9ukPnl/g==\" />\n<input type=\"hidden\" name=\"unique_id\" id=\"unique_id\" value=\"10_book_12067\" />\n<input type=\"hidden\" name=\"book_id\" id=\"book_id\" value=\"12067\" />\n<input type=\"hidden\" name=\"a\" id=\"a\" />\n<input type=\"hidden\" name=\"name\" id=\"name\" />\n<input type=\"hidden\" name=\"from_choice\" id=\"from_choice\" value=\"false\" />\n<input type=\"hidden\" name=\"from_home_module\" id=\"from_home_module\" value=\"false\" />\n<input type=\"hidden\" name=\"page_url\" id=\"page_url\" value=\"/quotes/tag/god\" />\n<\/form>\n\n<button class=\'wtrShelfButton\'><\/button>\n<\/div>\n\n<div class=\'ratingStars wtrRating\'>\n<div class=\'starsErrorTooltip hidden\'>\nError rating book. Refresh and try again.\n<\/div>\n<div class=\'myRating uitext greyText\'>Rate this book<\/div>\n<div class=\'clearRating uitext\'>Clear rating<\/div>\n<div class=\"stars\" data-resource-id=\"12067\" data-user-id=\"0\" data-submit-url=\"/review/rate/12067?page_url=%2Fquotes%2Ftag%2Fgod&rate_books_page=false&stars_click=false&wtr_button_id=10_book_12067\" data-rating=\"0\"><a class=\"star off\" title=\"did not like it\" href=\"#\" ref=\"\">1 of 5 stars<\/a><a class=\"star off\" title=\"it was ok\" href=\"#\" ref=\"\">2 of 5 stars<\/a><a class=\"star off\" title=\"liked it\" href=\"#\" ref=\"\">3 of 5 stars<\/a><a class=\"star off\" title=\"really liked it\" href=\"#\" ref=\"\">4 of 5 stars<\/a><a class=\"star off\" title=\"it was amazing\" href=\"#\" ref=\"\">5 of 5 stars<\/a><\/div>\n<\/div>\n\n<\/div>\n\n\n\n\n", { style: 'addbook', stem: 'leftMiddle', hook: { tip: 'leftMiddle', target: 'rightMiddle' }, offset: { x: 5, y: 0 }, hideOn: false, width: 400, hideAfter: 0.05, delay: 0.35 });
$('quote_book_link_12067').observe('prototip:shown', function() {
if (this.up('#box')) {
$$('div.prototip').each(function(i){i.setStyle({zIndex: $('box').getStyle('z-index')})});
} else {
$$('div.prototip').each(function(i){i.setStyle({zIndex: 6000})});
}
});
newTip['wrapper'].addClassName('prototipAllowOverflow');
$('quote_book_link_12067').observe('prototip:shown', function () {
$$('div.prototip').each(function (e) {
if ($('quote_book_link_12067').hasClassName('ignored')) {
e.setStyle({'display': 'none'});
return;
}
e.setStyle({'overflow': 'visible'});
});
});
$('quote_book_link_12067').observe('prototip:hidden', function () {
$$('span.elementTwo').each(function (e) {
if (e.getStyle('display') !== 'none') {
var lessLink = e.next();
swapContent(lessLink);
}
});
});
//]]>
</script>
</div>
<div class="quoteFooter">
<div class="greyText smallText left">
tags:
<a href="/quotes/tag/einstein">einstein</a>,
<a href="/quotes/tag/gaiman">gaiman</a>,
<a href="/quotes/tag/god">god</a>,
<a href="/quotes/tag/humor">humor</a>
</div>
<div class="right">
<a class="smallText" title="View this quote" href="/quotes/11285-god-does-not-play-dice-with-the-universe-he-plays">2464 likes</a>
</div>
</div>
</div>
Solution
I could not figure out how to do it using css selector so I used xpath path selector. Then I used MapCompose to remove the whitespace and join.
#spider.py snippet
def parse(self, response):
for quote in response.css("div.quoteDetails"):
l = ItemLoader(GoodreadsItem(), quote)
l.add_xpath("text", './/div[@class="quoteText"]/text() | .//div[@class="quoteText"]/i/text()')
yield l.load_item()
#items.py snippet
class GoodreadsItem(scrapy.Item):
text = scrapy.Field(
input_processor=MapCompose(lambda string: string.strip()),
output_processor=Join()
)
Answered By - Niq_Lin
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.