Thursday, December 8, 2022

[FIXED] Scrape data with <Script type="text/javascript" using beautifulsoup

December 08, 2022 beautifulsoup, python, web-scraping No comments

Issue

Im building a web scrape to pull product data from a website, this particular company hides the price behind a "login for Price" banner but the price is hidden in the HTML under <Script type="text/javascript" but im unable to pull it out. the specific link that im testing is https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/

My current code is this and the last line is the one im using to pull the text out.

```
import requests
from bs4 import BeautifulSoup
import pandas as pd

baseurl="https://www.chadwellsupply.com/"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}



productlinks = []
for x in range (1,3):
    response = requests.get(f'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/?q=&filter=&clearedfilter=undefined&orderby=19&pagesize=24&viewmode=list&currenttab=products&pagenumber={x}&articlepage=')
    soup = BeautifulSoup(response.content,'html.parser')

    productlist = soup.find_all('div', class_="product-header")



    for item in productlist:
        for link in item.find_all('a', href = True):
            productlinks.append(link['href'])
    


testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'

response = requests.get(testlink, headers = headers)

soup = BeautifulSoup(response.content,'html.parser')

print(soup.find('div',class_="product-title").text.strip())                
print(soup.find('p',class_="status").text.strip())         
print(soup.find('meta',{'property':'og:url'}))
print(soup.find('div',class_="tab-pane fade show active").text.strip())
print(soup.find('div',class_="Chadwell-Shared-Breadcrumbs").text.strip())
print(soup.find('script',{'type':'text/javascript'}).text.strip())
```

Below is the chunk of script from the website (tried to paste directly here but it wouldnt format correctly) that im expecting it to pull but what it gives me is "window.dataLayer = window.dataLayer || [];"

HTML From website

Ideally id like to just pull the price out but if i can atleast get the whole chunk of data out i can manually extract price.

Solution

You can use re/json module to search/parse the HTML data (obviously, beautifulsoup cannot parse JavaScript - another option is to use selenium).

import re
import json
import requests

url = "https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/"

html_doc = requests.get(url).text

data = re.search(r"ga\('ec:addProduct', (.*?)\);", html_doc).group(1)
data = json.loads(data)

print(data)

Prints:

{
    "id": "301078",
    "name": 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE',
    "category": "Stove/ Ranges",
    "brand": "Hotpoint",
    "price": "759",
}

Then for price you can do:

print(data["price"])

Prints:

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 8, 2022

[FIXED] Scrape data with <Script type="text/javascript" using beautifulsoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels