Sunday, December 5, 2021

[FIXED] Problem having same data while crawling a web page

December 05, 2021 scrapy No comments

Issue

I am trying to crawl a web page to get reviews and ratings of that web page. But i am getting the same data as the output.

import scrapy
import json
from scrapy.spiders import Spider


class RatingSpider(Spider):
    name = "rate"

    def start_requests(self):
        for i in range(1, 10):
            url = "https://www.fandango.com/aquaman-208499/movie-reviews?pn=" + str(i)
            print(url)
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        print(json.dumps({'rating': response.xpath("//div[@class='star-rating__score']").xpath("@style").extract(),
               'review': response.xpath("//p[@class='fan-reviews__item-content']/text()").getall()}))

expected: crawling 1000 pages of the web site https://www.fandango.com/aquaman-208499/movie-reviews

actual output:

https://mobile.fandango.com/aquaman-208498/movie-reviews?pn=1
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}

https://mobile.fandango.com/aquaman-208499/movie-reviews?pn=9
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}

Solution

The reviews are dynamically populated using JavaScript. You have to inspect the requests made by the site in cases likes this.

The URL to get user reviews is this:

https://www.fandango.com/napi/fanReviews/208499/1/5

It returns a json with 5 reviews.

Your spider could be rewrite like this:

import scrapy
import json
from scrapy.spiders import Spider


class RatingSpider(Spider):
    name = "rate"

    def start_requests(self):
        movie_id = "208499"
        for page in range(1, 10):
            # You have to pass the referer, otherwise the site returns a 403 error
            headers = {'referer': 'https://www.fandango.com/aquaman-208499/movie-reviews?pn={page}'.format(page=page)}
            url = "https://www.fandango.com/napi/fanReviews/208499/{page}/5".format(page=page)
            yield scrapy.Request(url=url, callback=self.parse, headers=headers)

    def parse(self, response):
        data = json.loads(response.text)
        for review in data['data']:
            yield review

Note that I am also using yield instead of print to extract the items, this is how Scrapy expect items to be generated. You can run this spider like this to export the extracted items to a file:

scrapy crawl rate -o outputfile.json

Answered By - Luiz Rodrigues da Silva

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, December 5, 2021

[FIXED] Problem having same data while crawling a web page

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels