Saturday, December 2, 2023

[FIXED] No data found when webscraping with Python?

December 02, 2023 beautifulsoup, python, selenium-webdriver, web-scraping No comments

Issue

So I'm fairly new to coding and I am supposed to be parsing Yelp reviews so I can analyze the data using Pandas. I have been trying to use selenium/beautifulsoup to automate the whole process and I was able to get past the chrome/webdriver issues by running it on my local machine. It technically "works" now but no data is displayed in the output. I feel like I've tried everything,
can someone please tell me what I'm doing wrong?
I feel like it could be a html tag class issue with the actual URL in the code but I am not sure which ones to use and it's weird to me how there's only 47 reviews on this particular business page but there's 1384 rows in the created CSV file.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import os

# Set the path to the ChromeDriver executable
chromedriver_path = "C:\\Users\\5mxz2\\Downloads\\chromedriver_win32\\chromedriver"

# Set the path to the Chrome binary
chrome_binary_path = "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe"  # Update this with the correct path to your Chrome binary

# Set the URL of the Yelp page you want to scrape
url = "https://www.yelp.com/biz/gelati-celesti-virginia-beach-2"

# Set the options for Chrome
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")  # Run Chrome in headless mode, comment this line if you want to see the browser window
chrome_options.binary_location = chrome_binary_path

# Create the ChromeDriver service
service = Service(chromedriver_path)

# Create the ChromeDriver instance
driver = webdriver.Chrome(service=service, options=chrome_options)

# Load the Yelp page
driver.get(url)

# Wait for the reviews to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".border-color--default__09f24__NPAKY")))

# Extract the page source and pass it to BeautifulSoup
soup = BeautifulSoup(driver.page_source, "html.parser")

# Find all review elements on the page
reviews = soup.find_all("div", class_="border-color--default__09f24__NPAKY")

# Create empty lists to store the extracted data
review_texts = []
ratings = []
dates = []

# Iterate over each review element
for review in reviews:
    # Extract the review text
    review_text_element = review.find("div", class_="margin-b2__09f24__CEMjT.border-color--default__09f24__NPAKY")
    review_text = review_text_element.get_text() if review_text_element else ""
    review_texts.append(review_text.strip())

    # Extract the rating
    rating_element = review.find("div", class_="five-stars__09f24__mBKym.five-stars--regular__09f24__DgBNj.display--inline-block__09f24__fEDiJ.border-color--default__09f24__NPAKY")
    rating = rating_element.get("aria-label") if rating_element else ""
    ratings.append(rating)

    # Extract the date
    date_element = review.find("span", class_="css-chan6m")
    date = date_element.get_text() if date_element else ""
    dates.append(date.strip())

# Create a DataFrame from the extracted data
data = {
    "Review Text": review_texts,
    "Rating": ratings,
    "Date": dates
}
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

# Get the current working directory
path = os.getcwd()

# Save the DataFrame as a CSV file
csv_path = os.path.join(path, "yelp_reviews.csv")
df.to_csv(csv_path, index=False)

# Close the ChromeDriver instance
driver.quit()

Here are some additional pictures and I just noticed that there was some information printed in the date column of the csv file, but they seemed randomly placed and not all of them are actually dates.

Solution

I have rewritten the code to do the same thing using requests, as selenium has unnecessary overhead.

from bs4 import BeautifulSoup as bs
import pandas as pd
import requests

restaurant_url = 'https://www.yelp.com/biz/gelati-celesti-virginia-beach-2'
headers = {
    'host': 'www.yelp.com'
}

restaurant_page = bs(requests.get(restaurant_url, headers=headers).text, 'lxml')
biz_id = restaurant_page.find('meta', {'name': 'yelp-biz-id'}).get('content')
review_count = int(restaurant_page.find('a', {'href': '#reviews'}).text.split(' ')[0]) 

data = []

for review_page in range(0, review_count, 10): # 10 reviews per page
    review_api_url = f'https://www.yelp.com/biz/{biz_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={review_page}'

    for review in requests.get(review_api_url, headers=headers).json()['reviews']:
        data.append({
            'Review Text': review['comment']['text'],
            'Rating': review['rating'],
            'Date': review['localizedDate']
        })
        print(data[-1])

pd.DataFrame(data).to_csv('Yelp Review.csv', index=None)

In this code, I am getting the business id (biz-id) and the total number of reviews from the restaurant page and using it in Yelp API to get all the reviews, saving it in a CSV at the end.

Sample output of the saved CSV is:

Answered By - Zero

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 2, 2023

[FIXED] No data found when webscraping with Python?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels