Issue
I am trying to crawl a web page to get reviews and ratings of that web page. But i am getting the same data as the output.
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
for i in range(1, 10):
url = "https://www.fandango.com/aquaman-208499/movie-reviews?pn=" + str(i)
print(url)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
print(json.dumps({'rating': response.xpath("//div[@class='star-rating__score']").xpath("@style").extract(),
'review': response.xpath("//p[@class='fan-reviews__item-content']/text()").getall()}))
expected: crawling 1000 pages of the web site https://www.fandango.com/aquaman-208499/movie-reviews
actual output:
https://mobile.fandango.com/aquaman-208498/movie-reviews?pn=1
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}
https://mobile.fandango.com/aquaman-208499/movie-reviews?pn=9
{"rating": ["width: 90%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 100%;", "width: 60%;"], "review": ["Everything and more that you would expect from Aquaman. Lots of action, humor, interpersonal conflict, and some romance.", "Best Movie ever action great story omg DC has stepped its game up excited for the next movie \n\nTotal must see total", "It was Awesome! Visually Stunning!", "It was fantastic five stars", "Very chaotic with too much action and confusion."]}
Solution
The reviews are dynamically populated using JavaScript. You have to inspect the requests made by the site in cases likes this.
The URL to get user reviews is this:
It returns a json with 5 reviews.
Your spider could be rewrite like this:
import scrapy
import json
from scrapy.spiders import Spider
class RatingSpider(Spider):
name = "rate"
def start_requests(self):
movie_id = "208499"
for page in range(1, 10):
# You have to pass the referer, otherwise the site returns a 403 error
headers = {'referer': 'https://www.fandango.com/aquaman-208499/movie-reviews?pn={page}'.format(page=page)}
url = "https://www.fandango.com/napi/fanReviews/208499/{page}/5".format(page=page)
yield scrapy.Request(url=url, callback=self.parse, headers=headers)
def parse(self, response):
data = json.loads(response.text)
for review in data['data']:
yield review
Note that I am also using yield instead of print to extract the items, this is how Scrapy expect items to be generated. You can run this spider like this to export the extracted items to a file:
scrapy crawl rate -o outputfile.json
Answered By - Luiz Rodrigues da Silva
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.