Tuesday, January 30, 2024

[FIXED] Webscrape using BS4 but text has same p class

January 30, 2024 beautifulsoup, python, web-scraping No comments

Issue

I'm trying to scrape the ingredient list from products on https://www.mecca.com/en-au/skincare/

The URL element in the first product for example (https://www.mecca.com/en-au/drunk-elephant/lala-retro-whipped-cream-V-038949/?cgpath=skincare) is:

<div role="region" title="Ingredients" id="sect5e670152-13f8-46ca-9f0b-4c60b533d079" aria-labelledby="accordion-5e670152-13f8-46ca-9f0b-4c60b533d079" aria-controls="accordion-5e670152-13f8-46ca-9f0b-4c60b533d079-container" class="css-1g2bsc4">
  <div>
    <p class="css-pdiqc3 e151354a4">Water/aqua/eau, glycerin, caprylic/capric triglyceride, isopropyl isostearate, pseudozyma epicola/camellia sinensis seed oil/glucose/glycine soja (soybean) meal/malt extract/yeast extract ferment filtrate, glyceryl stearate se, cetearyl alcohol, palmitic acid, stearic acid, pentylene glycol, plantago lanceolata leaf extract, adansonia digitata seed oil, citrullus lanatus (watermelon) seed oil, passiï¬‚ora edulis seed oil, schinziophyton rautanenii kernel oil, sclerocarya birrea seed oil, polyglyceryl6 ximenia americana seedate, cholesterol, ceramide ap, ceramide eop, sodium hyaluronate crosspolymer, ceramide np, phytosphingosine, ceteareth20, trisodium ethylenediamine disuccinate, tocopherol, sodium lauroyl lactylate, sodium hydroxide, citric acid, carbomer, xanthan gum, caprylyl glycol, chlorphenesin, phenoxyethanol, ethylhexylglycerin.</p></div></div>

I would like to get the Ingredient list, i.e. the Water/aqua/eau...

Below is my code, unfortunately there are several tags on the page with the same <p class="css-pdiqc3 e151354a4"

import requests
from bs4 import BeautifulSoup

url = "https://www.mecca.com"

productlinks = []

for x in range(1,5):
    soup = BeautifulSoup(requests.get(f'https://www.mecca.com/en-au/skincare/?page={x}').content, "html.parser")
    products = soup.find_all('div', class_="css-1r7iqog")

    for item in products:
        for link in item.find_all('a', href=True):
            productlinks.append(url + link['href'])
    
for link in productlinks:
    response = requests.get(link)

    soup = BeautifulSoup(response.content, "html.parser")
    brand = soup.find("span", class_="product-brand css-738lkl e11u5ot719").string
    name = soup.find('span', class_='product-name css-x4jxi0 e11u5ot718').string
    **Ingred = soup.find('p', class_='css-pdiqc3 e151354a4')** 
    


    print(brand)
    print(name)
    print(Ingred)

I was reading through the other posts and have tried some of the other solutions suggested such as:

Ingred = soup.find_all('p', class_='css-pdiqc3 e151354a4')[8].get_text()

or 

Details = soup.find_all('Ingredients', {'class':'css-1g2bsc4'})
Ingred = BeautifulSoup(str(Details).strip()).get_text()

but I can't seem to get it to work!

Solution

IIUC, you can try:

import requests
from bs4 import BeautifulSoup

url = "https://www.mecca.com/en-au/drunk-elephant/lala-retro-whipped-cream-V-038949/?cgpath=skincare"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

title = soup.h1.text
print(title)
print("-" * 80)

all_p = soup.select('h3:-soup-contains("ingredients") ~ p')
for p in all_p:
    ingr, desc = p.text.split(":")
    print(f"{ingr:<40} {desc:<60}")

Prints:

Drunk Elephant Lala Retro™ Whipped Cream
--------------------------------------------------------------------------------
Plantain extract                          promotes skin firmness and elasticity whilst evening skin tone.
Fermented green tea seed                  fights ageing, inflammation and protects against environmental aggressors.
Sodium hyaluronate crosspolymer           smooths fine lines and wrinkles whilst stimulating collagen production.

EDIT: The ingredients section is embedded inside <script> element, so to parse it, use json module:

import requests
from bs4 import BeautifulSoup

url = "https://www.mecca.com/en-au/drunk-elephant/lala-retro-whipped-cream-V-038949/?cgpath=skincare"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

title = soup.h1.text
print(title)
print("-" * 80)

data = soup.select_one("#__NEXT_DATA__").text
data = json.loads(data)

# print(json.dumps(data, indent=4))

ingredients = data["props"]["pageProps"]["pdpContent"]["ingredients"][0]
print(ingredients)

Prints:

Drunk Elephant Lala Retro™ Whipped Cream
--------------------------------------------------------------------------------
Water/aqua/eau, glycerin, caprylic/capric triglyceride, isopropyl isostearate, pseudozyma epicola/camellia sinensis seed oil/glucose/glycine soja (soybean) meal/malt extract/yeast extract ferment filtrate, glyceryl stearate se, cetearyl alcohol, palmitic acid, stearic acid, pentylene glycol, plantago lanceolata leaf extract, adansonia digitata seed oil, citrullus lanatus (watermelon) seed oil, passiï¬‚ora edulis seed oil, schinziophyton rautanenii kernel oil, sclerocarya birrea seed oil, polyglyceryl6 ximenia americana seedate, cholesterol, ceramide ap, ceramide eop, sodium hyaluronate crosspolymer, ceramide np, phytosphingosine, ceteareth20, trisodium ethylenediamine disuccinate, tocopherol, sodium lauroyl lactylate, sodium hydroxide, citric acid, carbomer, xanthan gum, caprylyl glycol, chlorphenesin, phenoxyethanol, ethylhexylglycerin.

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 30, 2024

[FIXED] Webscrape using BS4 but text has same p class

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels