Issue
I would like to retrieve information from Google Arts & Culture using BeautifulSoup
.
I have checked many of the stackoverflow posts ([1]
,
[2]
,
[3]
,
[4]
,
[5]
), and still couldn't retrieve the information.
I would like each tile (picture)'s (li
) information such as href, however, find_all
and select one
return empty list or None.
Could you help me get the below href value of anchor tag of class "e0WtYb HpzMff PJLMUc" ?
href="/entity/claude-monet/m01xnj?categoryId=artist"
Below are what I had tried.
import requests
from bs4 import BeautifulSoup
url = 'https://artsandculture.google.com/category/artist?tab=time&date=1850'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
print(soup.find_all('li', class_='DuHQbc')) # []
print(soup.find_all('a', class_='PJLMUc')) # []
print(soup.find_all('a', class_='e0WtYb HpzMff PJLMUc')) # []
print(soup.select_one('#tab_time > div > div:nth-child(2) > div > ul > li:nth-child(2) > a')) # None
for elem in soup.find_all('a', class_=['e0WtYb', 'HpzMff', 'PJLMUc'], href=True):
print(elem) # others with class 'e0WtYb'
...
# and then something like elem['href']
https://artsandculture.google.com/category/artist?tab=time&date=1850
Copied selector from Chrome
#tab_time > div > div:nth-child(2) > div > ul > li:nth-child(2) > a
Solution
Unfortunately, the problem is not that you're using BeautifulSoup
wrong. The webpage that you're requesting appears to be missing its content! I saved html.text
to a file for inspection:
Why does this happen? Because the webpage actually loads its content using JavaScript. When you open the site in your browser, the browser executes the JavaScript, which adds all of the artist squares to the webpage. (You may even notice the brief moment during which the squares aren't there when you first load the site.) On the other hand, requests
does NOT execute JavaScript—it just downloads the contents of the webpage and saves them to a string.
What can you do about it? Unfortunately, this means that scraping the website will be really tough. In such cases, I would suggest looking for an alternative source of information or using an API provided by the website.
Answered By - Thomas
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.