Friday, December 1, 2023

[FIXED] Issue with python script for scraping Trustpilot reviews

December 01, 2023 beautifulsoup, csv, python, web-scraping No comments

Issue

I am new to python and coding in general but trying to create a script to pull customer reviews from Trustpilot. I think I have something that works and tested it in Google Bard. I can get Bard to return results but when I run the same script on my Mac in PyCharm CE, it creates a .csv file with the right headers but no data.

I am sure I am missing something obvious. Why can Google Bard run the script and return results but when I run it on my machine I get just the headers in the csv file?

Any help would be much appreciated. I am getting no errors when I run it locally. I have python 3.12 installed and all the required modules.

Thanks.....Justin

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import csv
import datetime

# Create a new Selenium webdriver instance
driver = webdriver.Chrome()

# Navigate to the given page
driver.get("https://uk.trustpilot.com/review/www.whsmith.co.uk")

# Wait for the page to load
driver.implicitly_wait(10)

# Get the HTML source code of the page
html = driver.page_source

# Create a BeautifulSoup object from the HTML source code
soup = BeautifulSoup(html, "html.parser")

# Extract all of the reviews from the page
reviews = soup.findAll("div", class_="review")

# Create a new CSV file to store the reviews
with open("whsmith_reviews.csv", "w", newline="") as f:
    writer = csv.writer(f)

    # Write the header row
    writer.writerow(["Review Title", "Review Text", "Rating", "Review Date"])

    # Iterate over the reviews and write them to the CSV file
    for review in reviews:
        title = review.find("h2", class_="review-title").text
        text = review.find("p", class_="review-text").text
        rating = review.find("span", class_="review-rating").text
        date_str = review.find("span", class_="review-date").text
        date = datetime.datetime.strptime(date_str, "%d %b %Y")

        # Add the review to the CSV file
        writer.writerow([title, text, rating, date])

# Close the Selenium webdriver instance
driver.quit()

Solution

The main issue is your selection for the reviews there is no such div with class review, may focus on the articles:

soup.select('article'):

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs

There is also no need for selenium in this case, just take a look:

from bs4 import BeautifulSoup
import requests, csv


data = []

from_page = 1
to_page = 5

for i in range(from_page, to_page + 1):
    response = requests.get(f"https://uk.trustpilot.com/review/www.whsmith.co.uk")
    web_page = response.text
    soup = BeautifulSoup(web_page, "html.parser")

    for e in soup.select('article'):
        data.append({
            'review_title':e.h2.text,
            'review_date_original': e.select_one('[data-service-review-date-of-experience-typography]').text.split(': ')[-1],
            'review_rating':e.select_one('[data-service-review-rating] img').get('alt'),
            'review_text': e.select_one('[data-service-review-text-typography]').text if e.select_one('[data-service-review-text-typography]') else None,
            'page_number':i
        })



with open('zzz_my_result.csv', 'w', newline='') as output_file:
    dict_writer = csv.DictWriter(output_file, data[0].keys())
    dict_writer.writeheader()
    dict_writer.writerows(data)

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, December 1, 2023

[FIXED] Issue with python script for scraping Trustpilot reviews

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels