Issue
I need to extract a list of all the genres of any given movie from the movie page on IMDb.
For example:
- Movie page: https://www.imdb.com/title/tt0454848/?ref_=adv_li_i
- List of Genres: [Crime, Drama, Mystery, Thriller]
I tried using Beautiful Soup but I am not able to find the exact class under which the genres are stored.
Following are the snippets I tried:
ul= soup.find("ul", {"class": "ipc-metadata-list ipc-metadata-list--dividers-all sc-388740f9-1 IjgYL ipc-metadata-list--base"})
children = ul.findChildren("a", recursive=False)
This throws an error saying AttributeError: 'NoneType' object has no attribute 'findChildren'
class_selector = "ipc-inline-list__item"
genre = soup.find_all('li', {'class': class_selector})
list1 = []
for tag in list1:
list1.append(tag.find('a')).text
print(list1)
This return a list with no entries
Any help would be great!
Image of the website source code
Solution
According to your Screenshot, to get the list of genre, you can use selenium with bs4 as follows:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url='https://www.imdb.com/title/tt0454848/?ref_=adv_li_i'
driver.get(url)
driver.maximize_window()
time.sleep(5)
soup = BeautifulSoup(driver.page_source,'lxml')
t=soup.select_one('span.ipc-metadata-list-item__label:-soup-contains("Genres")').parent
genre=[x.get_text() for x in t.select('div[class="ipc-metadata-list-item__content-container"] > ul > li')]
print(genre)
Output:
['Crime', 'Drama', 'Mystery', 'Thriller']
Answered By - F.Hoque
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.