Issue
I'm trying to scrape the ingredient list from products on https://www.mecca.com/en-au/skincare/
The URL element in the first product for example (https://www.mecca.com/en-au/drunk-elephant/lala-retro-whipped-cream-V-038949/?cgpath=skincare) is:
<div role="region" title="Ingredients" id="sect5e670152-13f8-46ca-9f0b-4c60b533d079" aria-labelledby="accordion-5e670152-13f8-46ca-9f0b-4c60b533d079" aria-controls="accordion-5e670152-13f8-46ca-9f0b-4c60b533d079-container" class="css-1g2bsc4">
<div>
<p class="css-pdiqc3 e151354a4">Water/aqua/eau, glycerin, caprylic/capric triglyceride, isopropyl isostearate, pseudozyma epicola/camellia sinensis seed oil/glucose/glycine soja (soybean) meal/malt extract/yeast extract ferment filtrate, glyceryl stearate se, cetearyl alcohol, palmitic acid, stearic acid, pentylene glycol, plantago lanceolata leaf extract, adansonia digitata seed oil, citrullus lanatus (watermelon) seed oil, passiflora edulis seed oil, schinziophyton rautanenii kernel oil, sclerocarya birrea seed oil, polyglyceryl6 ximenia americana seedate, cholesterol, ceramide ap, ceramide eop, sodium hyaluronate crosspolymer, ceramide np, phytosphingosine, ceteareth20, trisodium ethylenediamine disuccinate, tocopherol, sodium lauroyl lactylate, sodium hydroxide, citric acid, carbomer, xanthan gum, caprylyl glycol, chlorphenesin, phenoxyethanol, ethylhexylglycerin.</p></div></div>
I would like to get the Ingredient list, i.e. the Water/aqua/eau...
Below is my code, unfortunately there are several tags on the page with the same <p class="css-pdiqc3 e151354a4"
import requests
from bs4 import BeautifulSoup
url = "https://www.mecca.com"
productlinks = []
for x in range(1,5):
soup = BeautifulSoup(requests.get(f'https://www.mecca.com/en-au/skincare/?page={x}').content, "html.parser")
products = soup.find_all('div', class_="css-1r7iqog")
for item in products:
for link in item.find_all('a', href=True):
productlinks.append(url + link['href'])
for link in productlinks:
response = requests.get(link)
soup = BeautifulSoup(response.content, "html.parser")
brand = soup.find("span", class_="product-brand css-738lkl e11u5ot719").string
name = soup.find('span', class_='product-name css-x4jxi0 e11u5ot718').string
**Ingred = soup.find('p', class_='css-pdiqc3 e151354a4')**
print(brand)
print(name)
print(Ingred)
I was reading through the other posts and have tried some of the other solutions suggested such as:
Ingred = soup.find_all('p', class_='css-pdiqc3 e151354a4')[8].get_text()
or
Details = soup.find_all('Ingredients', {'class':'css-1g2bsc4'})
Ingred = BeautifulSoup(str(Details).strip()).get_text()
but I can't seem to get it to work!
Solution
IIUC, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.mecca.com/en-au/drunk-elephant/lala-retro-whipped-cream-V-038949/?cgpath=skincare"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
title = soup.h1.text
print(title)
print("-" * 80)
all_p = soup.select('h3:-soup-contains("ingredients") ~ p')
for p in all_p:
ingr, desc = p.text.split(":")
print(f"{ingr:<40} {desc:<60}")
Prints:
Drunk Elephant Lala Retro™ Whipped Cream
--------------------------------------------------------------------------------
Plantain extract promotes skin firmness and elasticity whilst evening skin tone.
Fermented green tea seed fights ageing, inflammation and protects against environmental aggressors.
Sodium hyaluronate crosspolymer smooths fine lines and wrinkles whilst stimulating collagen production.
EDIT: The ingredients
section is embedded inside <script>
element, so to parse it, use json
module:
import requests
from bs4 import BeautifulSoup
url = "https://www.mecca.com/en-au/drunk-elephant/lala-retro-whipped-cream-V-038949/?cgpath=skincare"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
title = soup.h1.text
print(title)
print("-" * 80)
data = soup.select_one("#__NEXT_DATA__").text
data = json.loads(data)
# print(json.dumps(data, indent=4))
ingredients = data["props"]["pageProps"]["pdpContent"]["ingredients"][0]
print(ingredients)
Prints:
Drunk Elephant Lala Retro™ Whipped Cream
--------------------------------------------------------------------------------
Water/aqua/eau, glycerin, caprylic/capric triglyceride, isopropyl isostearate, pseudozyma epicola/camellia sinensis seed oil/glucose/glycine soja (soybean) meal/malt extract/yeast extract ferment filtrate, glyceryl stearate se, cetearyl alcohol, palmitic acid, stearic acid, pentylene glycol, plantago lanceolata leaf extract, adansonia digitata seed oil, citrullus lanatus (watermelon) seed oil, passiflora edulis seed oil, schinziophyton rautanenii kernel oil, sclerocarya birrea seed oil, polyglyceryl6 ximenia americana seedate, cholesterol, ceramide ap, ceramide eop, sodium hyaluronate crosspolymer, ceramide np, phytosphingosine, ceteareth20, trisodium ethylenediamine disuccinate, tocopherol, sodium lauroyl lactylate, sodium hydroxide, citric acid, carbomer, xanthan gum, caprylyl glycol, chlorphenesin, phenoxyethanol, ethylhexylglycerin.
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.