Wednesday, November 2, 2022

[FIXED] with bs4 check if a class in another class exists or not and save the results accordingly in a list

November 02, 2022 beautifulsoup, python, web-scraping No comments

Issue

I need to scrape information and store the information in a list, using bs4, from an html page that looks like this:

<div class="article-container">
    <div class="row">   
        <span class="color-primary">
            Title
        </span>
    </div>
    <div class="row">   
        <span class="color-secondary">
            Author Name
        </span>
    </div>
</div>

<div class="article-container">
    <div class="row">   
        <span class="color-primary">
            Title
        </span>
    </div>
</div>

For some articles the author's class is missing and this is how I'm trying to get the information

article_author = []

article_html = [x for x in soup.find_all("div", attrs={"class":"article-container"})] 
article_html_list.append(article_html)  


for html in article_html_list:
    if '<span class="color-secondary">' in str(html):
        author = str(html).split('<span class="color-secondary">')
        author = str(author[1]).rsplit('</span>')
        article_author.append(author[0].strip())
    else:
        article_author.append("None")

is there a better way to check if a class in another class is missing or not and save the results in a list?

Solution

Simply use your BeautifulSoup object and check if element you try to find is available or not:

author.get_text(strip=True) if (author := e.find('span', attrs={'class':'color-secondary'})) else None

Note: walrus operator requires Python 3.8 or later to work.

Alternative without walrus operater:

e.find('span', attrs={'class':'color-secondary'}).get_text(strip=True) if e.find('span', attrs={'class':'color-secondary'}) else None

Example

Instead of differnet lists for every attribute this example demonstrates how to use a single one with a dict for each article, to use a more structured way, storing the results:

from bs4 import BeautifulSoup
html='''
<div class="article-container">
    <div class="row">   
        <span class="color-primary">
            Title
        </span>
    </div>
    <div class="row">   
        <span class="color-secondary">
            Author Name
        </span>
    </div>
</div>

<div class="article-container">
    <div class="row">   
        <span class="color-primary">
            Title
        </span>
    </div>
</div>
'''

soup = BeautifulSoup(html)

data = []
for e in soup.find_all('div', attrs={'class':'article-container'}):
    data.append({
        'title': e.span.get_text(strip=True),
        'author': author.get_text(strip=True) if (author := e.find('span', attrs={'class':'color-secondary'})) else None
    })

data

Output

[{'title': 'Title', 'author': 'Author Name'},
 {'title': 'Title', 'author': None}]

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, November 2, 2022

[FIXED] with bs4 check if a class in another class exists or not and save the results accordingly in a list

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels