Wednesday, December 13, 2023

[FIXED] Cannot scrape data from Google Cloud's website

December 13, 2023 beautifulsoup, google-cloud-platform, python, web-scraping No comments

Issue

I am trying to scrape the email adresses from companies listed on Google Cloud's website, but when I scrape the code it doesn't include the tags from the page.

This is my code:

from bs4 import BeautifulSoup
import sys
import urllib.request
import requests

url = 'https://cloud.google.com/find-a-partner/partner/quantiphi-inc'
result = requests.get(url).text
doc = BeautifulSoup(result, "html.parser")
print(doc.prettify())

And this is my result:

<body><app-root></app-root><script nonce="_03s1Ejyv8NwTRGiuJXfgw" src="//www.gstatic.com/alkali/929b6f9d67240a61d7eef83db9b5065b01601907.js"></script></body>

but if you inspect the actual code on the website, it it was different. Is it because it is being fetched from the js file? Is there a way to bypass this?

Thanks!

I was hoping to scrape the entire html code of the page.

Solution

The problem is Google Cloud website is rendered dynamically using md3 widget; hence, nothing is returned when you fetch the HTML.

I would suggest you use selenium or playwright to fetch the website and get HTML source code, which would be rendered and same as what you see in the inspection tool.

Here is the complete code that would do what you want:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

url = "https://cloud.google.com/find-a-partner/partner/quantiphi-inc"
driver = webdriver.Chrome()
driver.get(url)
# 1. get the rendered HTML
selector = '*[aria-label="Partner email address"]'
# 1.a. wait until it's rendered
wait = WebDriverWait(driver, 10)
email_elements = wait.until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, selector))
)
# 1.b get the HTML
html = driver.page_source
# 2. parse
with open("output.html", "w") as f:
    f.write(html)
soup = BeautifulSoup(html, "html.parser")
email_elements = soup.select(selector)
for email_element in email_elements:
    print(email_element.text.strip())
driver.quit()

Answered By - Yubo

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, December 13, 2023

[FIXED] Cannot scrape data from Google Cloud's website

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels