Issue
I am scraping products description from a website. The product has "Old Price" and "New Price". All the products have these both except one (which has only the "New Price"). I append the values to an empty list. So there are four lists with "Product Names", "Product Old Price", "Product New Price" and "Product Reviews". When I try to make a CSV file it gives me an error "arrays must all be the same length". The reason for this error is: "Product Old Price" list has 17 entries and the other three lists have 18 entries. As explained earlier, in one product "Product Old Price" is not given. Below is my code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.petplanet.co.uk/d7/dog_food"
r = requests.get(url)
soup = BeautifulSoup(r.content)
prod_name =[]
prod_old_price = []
prod_new_price = []
prod_reviews = []
item = soup.findAll("a", class_ = "thumbLink")
for name in item[0:15]:
pro_name = name.get("title")
prod_name.append(pro_name)
price = soup.findAll("span", class_ = "price right")
for prices in price:
pro_new_price1 = prices.text
pro_new_price = pro_new_price1.replace("آ"," ")
prod_new_price.append(pro_new_price)
old_price = soup.findAll("span", class_ = "price-old")
for old_pri in old_price:
pro_old_price = old_pri.text
prod_old_price.append(pro_old_price)
reviews = soup.findAll("span", class_ = "text-prod-review-score")
for rev in reviews:
pro_reviews = (len(rev))
prod_reviews.append(pro_reviews)
old_price = soup.findAll("span", class_ = "price-old")
for old_pri in old_price:
pro_old_price = old_pri.text
prod_old_price.append(pro_old_price)
pet_products = pd.DataFrame({"Product Name": prod_name, "Product Old Price": prod_old_price, "Product New Price": prod_new_price, "Product Reviews as # of Star": prod_reviews})
pet_products.to_csv("Pets Products.csv")
I want "N/A" or "None" where there is no "Product Old Price" given. or is there any other way. Thanks
Solution
Recommendation
Loop the products in an other way and create a list
of dicts
it is easier to handle I think, also use find_all()
instead of die old version findAll()
What happens?
Cause old_price
is not in the page_source
if there is no sale_price
you wont find the right position to set a value of NA
the way your searching for.
Take a look at my example - If there is no old_price
it would raise an error but you can use this to create the NA
values:
try:
old_price = product.find("span", class_ = "price-old").get_text(strip=True)
except:
old_price = 'NA'
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.petplanet.co.uk/d7/dog_food"
r = requests.get(url)
soup = BeautifulSoup(r.content)
p_data = []
for product in soup.select('div#box-scroll-content li'):
new_price = product.find("span", class_ = "price right").get_text().replace("آ"," ")
try:
old_price = product.find("span", class_ = "price-old").get_text(strip=True)
except:
old_price = 'NA'
p_data.append({
'new_price': new_price,
'old_price': old_price
})
pd.DataFrame(p_data)
Output
new_price old_price
0 £69.99 £76.99
1 £2.19 None
2 £6.99 £11.49
3 £6.99 £10.99
4 £0.89 £1.00
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.