Issue
I'm scraping restaurant reviews from yelp, specifically from this url
I'm trying to get the list of review containers and, after testing with the chrome console, that would be given by the following xpath expression:
//li/div[@class='css-1qn0b6x']
However, by testing with scrapy shell, the following command returns an empty list
response.xpath("//li/div[@class='css-1qn0b6x']").extract()
Solution
Continuing from the comments above, see an example of how you can gather reviews by first getting the yelp-biz-id
from the HTML of the original page you linked:
import requests
from bs4 import BeautifulSoup
import pandas as pd
s = requests.Session()
url = 'https://www.yelp.it/biz/roscioli-roma-4'
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"}
resp = s.get(url,headers=headers)
soup = BeautifulSoup(resp.text,'html.parser')
biz_id = soup.find('meta',{'name':'yelp-biz-id'})['content']
reviews = []
for page in range(5):
api_url = f'https://www.yelp.it/biz/{biz_id}/review_feed?start={page*10}'
resp = s.get(api_url,headers=headers)
data = resp.json()
reviews = data['reviews']
if resp.status_code == 200 and len(reviews) > 0:
df = pd.json_normalize(reviews)
reviews.append(df)
final_df = pd.concat(reviews).reset_index()
final_df
Answered By - childnick
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.