Issue
So, I am trying to get "Amenities and More" portion of the Yelp page for a few restaurants. The issue is that I can get to the Amenities from the restaurant's yelp page that are displayed first. It however has "n more" button that when clicked gives more amenities. Using BeautifulSoup and selenium with the webpage url and using BeautifulSoup with requests gives exact same results and I am stuck as to how to open the whole Amenities before grabbing them in my code. Two pictures below show what happens before and after click of the button.
- "Before clicking '5 More Attributes': The first pic shows 4 "div" within which lies "span" that I can get to using any of the above methods.
- "After clicking '5 More Attributes': The second pic shows 9 "div" within which lies "span" that I am trying to get to.
Here is the code using selenium/beautifulsoup
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
URL ='https://www.yelp.com/biz/ziggis-coffee-longmont'
driver =
webdriver.Chrome(r"C:\Users\Fariha\AppData\Local\Programs\chromedriver_win32\chromedriver.exe")
driver.get(URL)
yelp_page_source_page1 = driver.page_source
soup = BeautifulSoup(yelp_page_source_page1,'html.parser')
spans = soup.find_all('span')
Result: There are 990 elements in "spans". I am only showing what is relevant for my question:
Solution
An alternative approach would be to extract the data directly from the JSON api on the site. This could be done without the overhead of selenium as follows:
from bs4 import BeautifulSoup
import requests
import json
session = requests.Session()
r = session.get('https://www.yelp.com/biz/ziggis-coffee-longmont')
#r = session.get('https://www.yelp.com/biz/menchies-frozen-yogurt-lafayette')
soup = BeautifulSoup(r.content, 'lxml')
# Locate the business ID to use (from JSON inside one of the script entries)
for script in soup.find_all('script', attrs={"type" : "application/json"}):
gaConfig = json.loads(script.text.strip('<!-->'))
try:
biz_id = gaConfig['gaConfig']['dimensions']['www']['business_id'][1]
break
except KeyError:
pass
# Build a suitable JSON request for the required information
json_post = [
{
"operationName": "GetBusinessAttributes",
"variables": {
"BizEncId": biz_id
},
"extensions": {
"documentId": "35e0950cee1029aa00eef5180adb55af33a0217c64f379d778083eb4d1c805e7"
}
},
{
"operationName": "GetBizPageProperties",
"variables": {
"BizEncId": biz_id
},
"extensions": {
"documentId": "f06d155f02e55e7aadb01d6469e34d4bad301f14b6e0eba92a31e635694ebc21"
}
},
]
r = session.post('https://www.yelp.com/gql/batch', json=json_post)
j = r.json()
business = j[0]['data']['business']
print(business['name'], '\n')
for property in j[1]['data']['business']['organizedProperties'][0]['properties']:
print(f'{"Yes" if property["isActive"] else "No":5} {property["displayText"]}')
This would give you the following entries:
Ziggi's Coffee
Yes Offers Delivery
Yes Offers Takeout
Yes Accepts Credit Cards
Yes Private Lot Parking
Yes Bike Parking
Yes Drive-Thru
No No Outdoor Seating
No No Wi-Fi
How was this solved?
Your best friend here is your browser's network dev tools. With this you can watch the requests made to obtain the information. The normal process flow is the initial HTML page is downloaded, this runs the javascript and requests more data to further fill the page.
The trick is to first locate where the data you want is (often returned as JSON), then determine what you need to recreate the parameters needed to make the request for it.
To further understand this code, use print()
. Print everything, it will show you how each part builds on the next part. It is how the script was written, one bit at a time.
Approaches using Selenium allow the javascript to work, but most times this is not needed as it is just making requests and formatting the data for display.
Answered By - Martin Evans
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.