Sunday, December 5, 2021

[FIXED] Scrapy spider finding one "Next" button but not the other

December 05, 2021 python-3.x, scrapy No comments

Issue

I am writing a spider to scrape a popular reviews website :-) This is my first attempt at writing a Scrapy spider.

The top level is a list of restaurants (I call this "top level"), which appear 30 at a time. My spider accesses each link and then "clicks next" to get the next 30, and so on. This part is working as my output does contain thousands of restaurants, not just the first 30.

I then want it to "click" on the link to each restaurant page ("restaurant level"), but this contains only truncated versions of the reviews, so I want it to then "click" down a further level (to "review level") and scrape the reviews from there, which appear 5 at a time with another "next" button. This is the only "level" from which I am extracting anything - the other levels just have links to access to get to the reviews and other info I want.

Most of this is working as I am getting all the information I want, but only for the first 5 reviews per restaurant. It is not "finding" the "next" button on the bottom "review level".

I have tried changing the order of commands within the parse method, but other than that I am coming up short of ideas! My xpaths are fine so it must be something to do with structure of the spider.

My spider looks thus:

import scrapy
from scrapy.http import Request

class TripSpider(scrapy.Spider):

    name = 'tripadvisor'
    allowed_domains = ['tripadvisor.co.uk']
    start_urls = ['https://www.tripadvisor.co.uk/Restaurants-g187069-Manchester_Greater_Manchester_England.html']
    custom_settings = {
       'DOWNLOAD_DELAY': 1,
       # 'DEPTH_LIMIT': 3,
       'AUTOTHROTTLE_TARGET_CONCURRENCY': 0.5,
       'USER_AGENT': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
       # 'DEPTH_PRIORITY': 1,
       # 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
       # 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue'
    }

    def scrape_review(self, response):
        restaurant_name_review = response.xpath('//div[@class="wrap"]//span[@class="taLnk "]//text()').extract()
        reviewer_name = response.xpath('//div[@class="username mo"]//text()').extract()
        review_rating = response.xpath('//div[@class="wrap"]/div[@class="rating reviewItemInline"]/span[starts-with(@class,"ui_bubble_rating")]').extract()
        review_title = response.xpath('//div[@class="wrap"]//span[@class="noQuotes"]//text()').extract()
        full_reviews = response.xpath('//div[@class="wrap"]/div[@class="prw_rup prw_reviews_text_summary_hsx"]/div[@class="entry"]/p').extract()
        review_date = response.xpath('//div[@class="prw_rup prw_reviews_stay_date_hsx"]/text()[not(parent::script)]').extract()
        restaurant_name = response.xpath('//div[@id="listing_main_sur"]//a[@class="HEADING"]//text()').extract() * len(full_reviews)
        restaurant_rating = response.xpath('//div[@class="userRating"]//@alt').extract() * len(full_reviews)
        restaurant_review_count = response.xpath('//div[@class="userRating"]//a//text()').extract() * len(full_reviews)

        for rvn, rvr, rvt, fr, rd, rn, rr, rvc in zip(reviewer_name, review_rating, review_title, full_reviews, review_date, restaurant_name, restaurant_rating, restaurant_review_count):
            reviews_dict = dict(zip(['reviewer_name', 'review_rating', 'review_title', 'full_reviews', 'review_date', 'restaurant_name', 'restaurant_rating', 'restaurant_review_count'], (rvn, rvr, rvt, fr, rd, rn, rr, rvc)))
            yield reviews_dict
            # print(reviews_dict)

    def parse(self, response):
        ### The parse method is what is actually being repeated / iterated
        for review in self.scrape_review(response):
            yield review
            # print(review)

        # access next page of resturants
        next_page_restaurants = response.xpath('//a[@class="nav next rndBtn ui_button primary taLnk"]/@href').extract_first()
        next_page_restaurants_url = response.urljoin(next_page_restaurants)
        yield Request(next_page_restaurants_url)
        print(next_page_restaurants_url)

        # access next page of reviews
        next_page_reviews = response.xpath('//a[@class="nav next taLnk "]/@href').extract_first()
        next_page_reviews_url = response.urljoin(next_page_reviews)
        yield Request(next_page_reviews_url)
        print(next_page_reviews_url)

        # access each restaurant page:
        url = response.xpath('//div[@id="EATERY_SEARCH_RESULTS"]/div/div/div/div/a[@target="_blank"]/@href').extract()
        for url_next in url:
            url_full = response.urljoin(url_next)
            yield Request(url_full)

        # "accesses the first review to get to the full reviews (not the truncated versions)"
        first_review = response.xpath('//a[@class="title "]/@href').extract_first() # extract first used as I only want to access one of the links on this page to get down to "review level"
        first_review_full = response.urljoin(first_review)
        yield Request(first_review_full)
        # print(first_review_full)

Solution

You are missing a space at the end of the class value:

Try this:

next_page_reviews = response.xpath('//a[@class="nav next taLnk "]/@href').extract_first()

Here are some tips on matching classes partially: https://docs.scrapy.org/en/latest/topics/selectors.html#when-querying-by-class-consider-using-css

On a side note, you can define separate parse functions to make it clearer what each one is responsible for: https://docs.scrapy.org/en/latest/intro/tutorial.html?highlight=callback#more-examples-and-patterns

Answered By - malberts

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 5, 2021

[FIXED] Scrapy spider finding one "Next" button but not the other

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels