Issue
I am trying to scrape the email adresses from companies listed on Google Cloud's website, but when I scrape the code it doesn't include the tags from the page.
This is my code:
from bs4 import BeautifulSoup
import sys
import urllib.request
import requests
url = 'https://cloud.google.com/find-a-partner/partner/quantiphi-inc'
result = requests.get(url).text
doc = BeautifulSoup(result, "html.parser")
print(doc.prettify())
And this is my result:
<body><app-root></app-root><script nonce="_03s1Ejyv8NwTRGiuJXfgw" src="//www.gstatic.com/alkali/929b6f9d67240a61d7eef83db9b5065b01601907.js"></script></body>
but if you inspect the actual code on the website, it it was different. Is it because it is being fetched from the js file? Is there a way to bypass this?
Thanks!
I was hoping to scrape the entire html code of the page.
Solution
The problem is Google Cloud website is rendered dynamically using md3 widget; hence, nothing is returned when you fetch the HTML.
I would suggest you use selenium
or playwright
to fetch the website and get HTML source code, which would be rendered and same as what you see in the inspection tool.
Here is the complete code that would do what you want:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
url = "https://cloud.google.com/find-a-partner/partner/quantiphi-inc"
driver = webdriver.Chrome()
driver.get(url)
# 1. get the rendered HTML
selector = '*[aria-label="Partner email address"]'
# 1.a. wait until it's rendered
wait = WebDriverWait(driver, 10)
email_elements = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, selector))
)
# 1.b get the HTML
html = driver.page_source
# 2. parse
with open("output.html", "w") as f:
f.write(html)
soup = BeautifulSoup(html, "html.parser")
email_elements = soup.select(selector)
for email_element in email_elements:
print(email_element.text.strip())
driver.quit()
Answered By - Yubo
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.