Issue
Im building a web scrape to pull product data from a website, this particular company hides the price behind a "login for Price" banner but the price is hidden in the HTML under <Script type="text/javascript" but im unable to pull it out. the specific link that im testing is https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/
My current code is this and the last line is the one im using to pull the text out.
```
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl="https://www.chadwellsupply.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
productlinks = []
for x in range (1,3):
response = requests.get(f'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/?q=&filter=&clearedfilter=undefined&orderby=19&pagesize=24&viewmode=list¤ttab=products&pagenumber={x}&articlepage=')
soup = BeautifulSoup(response.content,'html.parser')
productlist = soup.find_all('div', class_="product-header")
for item in productlist:
for link in item.find_all('a', href = True):
productlinks.append(link['href'])
testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'
response = requests.get(testlink, headers = headers)
soup = BeautifulSoup(response.content,'html.parser')
print(soup.find('div',class_="product-title").text.strip())
print(soup.find('p',class_="status").text.strip())
print(soup.find('meta',{'property':'og:url'}))
print(soup.find('div',class_="tab-pane fade show active").text.strip())
print(soup.find('div',class_="Chadwell-Shared-Breadcrumbs").text.strip())
print(soup.find('script',{'type':'text/javascript'}).text.strip())
```
Below is the chunk of script from the website (tried to paste directly here but it wouldnt format correctly) that im expecting it to pull but what it gives me is "window.dataLayer = window.dataLayer || [];"
Ideally id like to just pull the price out but if i can atleast get the whole chunk of data out i can manually extract price.
Solution
You can use re
/json
module to search/parse the HTML data (obviously, beautifulsoup
cannot parse JavaScript - another option is to use selenium
).
import re
import json
import requests
url = "https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/"
html_doc = requests.get(url).text
data = re.search(r"ga\('ec:addProduct', (.*?)\);", html_doc).group(1)
data = json.loads(data)
print(data)
Prints:
{
"id": "301078",
"name": 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE',
"category": "Stove/ Ranges",
"brand": "Hotpoint",
"price": "759",
}
Then for price you can do:
print(data["price"])
Prints:
759
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.