Wednesday, November 15, 2023

[FIXED] Why is the output writing each entry twice when I scrape this practice site using BeautifulSoup in Python?

November 15, 2023 beautifulsoup, python, web-scraping No comments

Issue

I'm brand new to programming and to web scraping. I've managed to scrape a practice site, but the output is writing each entry twice. Can someone tell me why? Thanks!

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Get base url for later in order to make complete links to each product
baseurl = 'http://books.toscrape.com/catalogue/'

# Input a standard user agent
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

# Create empty list to store later data
productlinks = []

# Create outer loop to loop through each page of the ecommerce site using f-string, *confirmed*
for x in range(1,51):
    # Request to get links from index page, create Soup
    r = requests.get(f'http://books.toscrape.com/catalogue/page-{x}.html')
    soup = BeautifulSoup(r.content, 'lxml')

    # Direct script to container for each product, *confirmed*
    productlist = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

    # Create inner loop to go through each product container and take out href, *confirmed*, concatenate with base url, *confirmed*
    for item in productlist:
        for link in item.find_all('a', href=True):
            productlinks.append(baseurl + link['href'])

# Test to go through each signle product page to extract data, *confirmed*
#testlink = 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'

bookslist = []

for link in productlinks:
    # Create request for testlink, insert user agent header, create Soup; (later changed 'testlink' to 'link' to create loop for productlinks)
    r = requests.get(link, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')

    # Extract name on its own
    name = soup.find('h1').text.strip()

    # Create loop to extract data from table, *confirmed*
    table = soup.find('table', class_='table table-striped')
    table_data = table.find_all('td')

    list = []

    for data in table_data:
        elements = data.text
        list.append(elements)

    # Create dictionary, *confirmed*
    books = {
        'name': name,
        'upc': list[0],
        'product': list[1],
        'price': list[2],
        'availability': list[5],
        'no. of reviews': list[6]
    }
    # Append each book's data to final bookslist
    bookslist.append(books)
    print('Saving: ', books['name'])

In short, I followed along with a video, but used a website of my own choosing. In the video I watched, the individual did not have to loop through a table for the non-name data. I think my issue is somewhere there, but I can't pinpoint the exact cause.

Solution

Your issue is with the following code snippet:

for item in productlist:
        for link in item.find_all('a', href=True):
            productlinks.append(baseurl + link['href'])

Each item contains an image with a link to book' profile, as well as a title with a link to book' profile.

Two possible ways to fix this:

remove duplicates from productlinks with for link in list(set(productlinks))

instead of selecting all links in each item with

for link in item.find_all('a', href=True):
    productlinks.append(baseurl + link['href'])

just select the first link in each item:

for item in productlist:
        unique_prod_link = item.find_all('a', href=True)[0]
        productlinks.append(baseurl + unique_prod_link['href'])

Answered By - Barry the Platipus

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 15, 2023

[FIXED] Why is the output writing each entry twice when I scrape this practice site using BeautifulSoup in Python?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels