Saturday, December 2, 2023

[FIXED] Scraping three diffrent texts in three same html codes

December 02, 2023 beautifulsoup, excel, openpyxl, python, web-scraping No comments

Issue

In this code I just want to scrape the movies' duration data but it also scrapes the movies' release dates too!!! I post the HTML codes as well to show you the problem better.

import openpyxl as opx
from bs4 import BeautifulSoup
import requests

wb = opx.Workbook()
ws = wb.active
ws.title = "Movies"

header_row = ["Name", "Date", "Rate", "duration"]
ws.append(header_row)

url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
header = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}

response = requests.get(url, headers=header)
html_content = response.content
soup = BeautifulSoup(html_content, "html.parser")
movies = soup.find_all(
    "li", class_="ipc-metadata-list-summary-item sc-bca49391-0 eypSaE cli-parent")


for movie in movies:
    name = movie.find("h3", class_="ipc-title__text").text.strip()
    date = movie.find(
        "span", class_="sc-14dd939d-6 kHVqMR cli-title-metadata-item")
    rate = movie.find(
        "span", class_="ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating")

    print(duration.text)

# wb.save("sample.xlsx")

enter image description here

Solution

There are multiple ways to do this. One is using find_all which returns list of matches.

...
duration = movie.find_all('span', class_='sc-14dd939d-6 kHVqMR cli-title-metadata-item')[1]
...

Next one is using select or select_one.

duration = movie.select_one('span.sc-14dd939d-6.kHVqMR.cli-title-metadata-item:nth-child(2)')

select and select_one uses CSS Selectors. In above code, it searches for 2nd span inside movie element that has classes including sc-14dd939d-6, kHVqMR and cli-title-metadata-item. Note that since it uses css, it selects any span containing those three classes even if there are more classes.

To check if second element is present before taking text:

duration = duration.text.strip() if duration else '-'

You can check more in BeautifulSoup Docs. It is really well-written with numerous examples.

Answered By - Reyot

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, December 2, 2023

[FIXED] Scraping three diffrent texts in three same html codes

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels