Tuesday, October 4, 2022

[FIXED] Beautiful Soup select google image returns empty list

October 04, 2022 beautifulsoup, python, web-crawler No comments

Issue

I would like to retrieve information from Google Arts & Culture using BeautifulSoup. I have checked many of the stackoverflow posts ([1], [2], [3], [4], [5]), and still couldn't retrieve the information.

I would like each tile (picture)'s (li) information such as href, however, find_all and select one return empty list or None.

Could you help me get the below href value of anchor tag of class "e0WtYb HpzMff PJLMUc" ?

href="/entity/claude-monet/m01xnj?categoryId=artist"

Below are what I had tried.

import requests
from bs4 import BeautifulSoup

url = 'https://artsandculture.google.com/category/artist?tab=time&date=1850'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
print(soup.find_all('li', class_='DuHQbc'))                 # []
print(soup.find_all('a', class_='PJLMUc'))                  # []
print(soup.find_all('a', class_='e0WtYb HpzMff PJLMUc'))    # []
print(soup.select_one('#tab_time > div > div:nth-child(2) > div > ul > li:nth-child(2) > a'))  # None
for elem in soup.find_all('a', class_=['e0WtYb', 'HpzMff', 'PJLMUc'], href=True):
    print(elem)  # others with class 'e0WtYb'

...
# and then something like elem['href']

https://artsandculture.google.com/category/artist?tab=time&date=1850

Copied selector from Chrome

#tab_time > div > div:nth-child(2) > div > ul > li:nth-child(2) > a

Solution

Unfortunately, the problem is not that you're using BeautifulSoup wrong. The webpage that you're requesting appears to be missing its content! I saved html.text to a file for inspection:

Why does this happen? Because the webpage actually loads its content using JavaScript. When you open the site in your browser, the browser executes the JavaScript, which adds all of the artist squares to the webpage. (You may even notice the brief moment during which the squares aren't there when you first load the site.) On the other hand, requests does NOT execute JavaScript—it just downloads the contents of the webpage and saves them to a string.

What can you do about it? Unfortunately, this means that scraping the website will be really tough. In such cases, I would suggest looking for an alternative source of information or using an API provided by the website.

Answered By - Thomas

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, October 4, 2022

[FIXED] Beautiful Soup select google image returns empty list

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels