Issue
I'm brand new to programming and to web scraping. I've managed to scrape a practice site, but the output is writing each entry twice. Can someone tell me why? Thanks!
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Get base url for later in order to make complete links to each product
baseurl = 'http://books.toscrape.com/catalogue/'
# Input a standard user agent
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
# Create empty list to store later data
productlinks = []
# Create outer loop to loop through each page of the ecommerce site using f-string, *confirmed*
for x in range(1,51):
# Request to get links from index page, create Soup
r = requests.get(f'http://books.toscrape.com/catalogue/page-{x}.html')
soup = BeautifulSoup(r.content, 'lxml')
# Direct script to container for each product, *confirmed*
productlist = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
# Create inner loop to go through each product container and take out href, *confirmed*, concatenate with base url, *confirmed*
for item in productlist:
for link in item.find_all('a', href=True):
productlinks.append(baseurl + link['href'])
# Test to go through each signle product page to extract data, *confirmed*
#testlink = 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
bookslist = []
for link in productlinks:
# Create request for testlink, insert user agent header, create Soup; (later changed 'testlink' to 'link' to create loop for productlinks)
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
# Extract name on its own
name = soup.find('h1').text.strip()
# Create loop to extract data from table, *confirmed*
table = soup.find('table', class_='table table-striped')
table_data = table.find_all('td')
list = []
for data in table_data:
elements = data.text
list.append(elements)
# Create dictionary, *confirmed*
books = {
'name': name,
'upc': list[0],
'product': list[1],
'price': list[2],
'availability': list[5],
'no. of reviews': list[6]
}
# Append each book's data to final bookslist
bookslist.append(books)
print('Saving: ', books['name'])
In short, I followed along with a video, but used a website of my own choosing. In the video I watched, the individual did not have to loop through a table for the non-name data. I think my issue is somewhere there, but I can't pinpoint the exact cause.
Solution
Your issue is with the following code snippet:
for item in productlist:
for link in item.find_all('a', href=True):
productlinks.append(baseurl + link['href'])
Each item
contains an image with a link to book' profile, as well as a title with a link to book' profile.
Two possible ways to fix this:
- remove duplicates from
productlinks
withfor link in list(set(productlinks))
or
- instead of selecting all links in each
item
with
for link in item.find_all('a', href=True):
productlinks.append(baseurl + link['href'])
just select the first link in each item:
for item in productlist:
unique_prod_link = item.find_all('a', href=True)[0]
productlinks.append(baseurl + unique_prod_link['href'])
Answered By - Barry the Platipus
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.