Issue
I'm scraping restaurant reviews from Yelp and I'm accessing the restaurant's APIs to do so. I'm currently scraping 4 star reviews, for example this restaurant page has this corresponding API.
This is the block of code that sends an http request to the API when the crawler is currently on the restaurant page
bizId = response.xpath("//meta[@name='yelp-biz-id']/@content").extract_first()
api_url = 'https://www.yelp.it/biz/' + bizId + '/review_feed?rr=' + str(n_star_filter)
yield response.follow(url=api_url, callback = self.parse_yelp_restaurant_api)
Sometimes the API are accessed correctly and I'm able to scrape them. However, most of the time, I get this error:
2023-10-27 15:57:39 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.yelp.it/biz/78t73jTxdUw5C-v44lj4Iw/review_feed?rr=4>
Traceback (most recent call last):
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
result = context.run(gen.send, result)
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 64, in process_response
method(request=request, response=response, spider=spider)
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 63, in process_response
decoded_body = self._decode(response.body, encoding.lower())
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 102, in _decode
body = brotli.decompress(body)
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/brotli/brotli.py", line 90, in decompress
d.finish()
File "/Users/mauri/anaconda3/lib/python3.11/site-packages/brotli/brotli.py", line 464, in finish
raise Error("Decompression error: incomplete compressed stream.")
brotli.brotli.Error: Decompression error: incomplete compressed stream.
I can't figure out what this means and it's really weird that some APIs are downloaded and others produce this error when they apparently are no different from each other.
Solution
This is likely a violation of yelp policy, such websites don't like when people scrape data in this fasion. For example, this policy says
Use any robot, spider, Service search/retrieval application, or other automated device, process or means to access, retrieve, copy, scrape, or index any portion of the Service or any Service Content, except as expressly permitted by Yelp (for example, as described at www.yelp.com/robots.txt);
Based on the code and behaviour, it's likely that the server detects automated scraping and cuts off the response halfway through. This is not a compression problem. You may want to see Yelp API access via https://www.yelp.com/developers.
Answered By - oleksii
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.