Issue
I am writing a spider to scrape a popular reviews website :-) This is my first attempt at writing a Scrapy spider.
The top level is a list of restaurants (I call this "top level"), which appear 30 at a time. My spider accesses each link and then "clicks next" to get the next 30, and so on. This part is working as my output does contain thousands of restaurants, not just the first 30.
I then want it to "click" on the link to each restaurant page ("restaurant level"), but this contains only truncated versions of the reviews, so I want it to then "click" down a further level (to "review level") and scrape the reviews from there, which appear 5 at a time with another "next" button. This is the only "level" from which I am extracting anything - the other levels just have links to access to get to the reviews and other info I want.
Most of this is working as I am getting all the information I want, but only for the first 5 reviews per restaurant. It is not "finding" the "next" button on the bottom "review level".
I have tried changing the order of commands within the parse method, but other than that I am coming up short of ideas! My xpaths are fine so it must be something to do with structure of the spider.
My spider looks thus:
import scrapy
from scrapy.http import Request
class TripSpider(scrapy.Spider):
name = 'tripadvisor'
allowed_domains = ['tripadvisor.co.uk']
start_urls = ['https://www.tripadvisor.co.uk/Restaurants-g187069-Manchester_Greater_Manchester_England.html']
custom_settings = {
'DOWNLOAD_DELAY': 1,
# 'DEPTH_LIMIT': 3,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 0.5,
'USER_AGENT': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
# 'DEPTH_PRIORITY': 1,
# 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
# 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue'
}
def scrape_review(self, response):
restaurant_name_review = response.xpath('//div[@class="wrap"]//span[@class="taLnk "]//text()').extract()
reviewer_name = response.xpath('//div[@class="username mo"]//text()').extract()
review_rating = response.xpath('//div[@class="wrap"]/div[@class="rating reviewItemInline"]/span[starts-with(@class,"ui_bubble_rating")]').extract()
review_title = response.xpath('//div[@class="wrap"]//span[@class="noQuotes"]//text()').extract()
full_reviews = response.xpath('//div[@class="wrap"]/div[@class="prw_rup prw_reviews_text_summary_hsx"]/div[@class="entry"]/p').extract()
review_date = response.xpath('//div[@class="prw_rup prw_reviews_stay_date_hsx"]/text()[not(parent::script)]').extract()
restaurant_name = response.xpath('//div[@id="listing_main_sur"]//a[@class="HEADING"]//text()').extract() * len(full_reviews)
restaurant_rating = response.xpath('//div[@class="userRating"]//@alt').extract() * len(full_reviews)
restaurant_review_count = response.xpath('//div[@class="userRating"]//a//text()').extract() * len(full_reviews)
for rvn, rvr, rvt, fr, rd, rn, rr, rvc in zip(reviewer_name, review_rating, review_title, full_reviews, review_date, restaurant_name, restaurant_rating, restaurant_review_count):
reviews_dict = dict(zip(['reviewer_name', 'review_rating', 'review_title', 'full_reviews', 'review_date', 'restaurant_name', 'restaurant_rating', 'restaurant_review_count'], (rvn, rvr, rvt, fr, rd, rn, rr, rvc)))
yield reviews_dict
# print(reviews_dict)
def parse(self, response):
### The parse method is what is actually being repeated / iterated
for review in self.scrape_review(response):
yield review
# print(review)
# access next page of resturants
next_page_restaurants = response.xpath('//a[@class="nav next rndBtn ui_button primary taLnk"]/@href').extract_first()
next_page_restaurants_url = response.urljoin(next_page_restaurants)
yield Request(next_page_restaurants_url)
print(next_page_restaurants_url)
# access next page of reviews
next_page_reviews = response.xpath('//a[@class="nav next taLnk "]/@href').extract_first()
next_page_reviews_url = response.urljoin(next_page_reviews)
yield Request(next_page_reviews_url)
print(next_page_reviews_url)
# access each restaurant page:
url = response.xpath('//div[@id="EATERY_SEARCH_RESULTS"]/div/div/div/div/a[@target="_blank"]/@href').extract()
for url_next in url:
url_full = response.urljoin(url_next)
yield Request(url_full)
# "accesses the first review to get to the full reviews (not the truncated versions)"
first_review = response.xpath('//a[@class="title "]/@href').extract_first() # extract first used as I only want to access one of the links on this page to get down to "review level"
first_review_full = response.urljoin(first_review)
yield Request(first_review_full)
# print(first_review_full)
Solution
You are missing a space at the end of the class value:
Try this:
next_page_reviews = response.xpath('//a[@class="nav next taLnk "]/@href').extract_first()
Here are some tips on matching classes partially: https://docs.scrapy.org/en/latest/topics/selectors.html#when-querying-by-class-consider-using-css
On a side note, you can define separate parse functions to make it clearer what each one is responsible for: https://docs.scrapy.org/en/latest/intro/tutorial.html?highlight=callback#more-examples-and-patterns
Answered By - malberts
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.