Issue
I need to scrape information and store the information in a list, using bs4, from an html page that looks like this:
<div class="article-container">
<div class="row">
<span class="color-primary">
Title
</span>
</div>
<div class="row">
<span class="color-secondary">
Author Name
</span>
</div>
</div>
<div class="article-container">
<div class="row">
<span class="color-primary">
Title
</span>
</div>
</div>
For some articles the author's class is missing and this is how I'm trying to get the information
article_author = []
article_html = [x for x in soup.find_all("div", attrs={"class":"article-container"})]
article_html_list.append(article_html)
for html in article_html_list:
if '<span class="color-secondary">' in str(html):
author = str(html).split('<span class="color-secondary">')
author = str(author[1]).rsplit('</span>')
article_author.append(author[0].strip())
else:
article_author.append("None")
is there a better way to check if a class in another class is missing or not and save the results in a list?
Solution
Simply use your BeautifulSoup
object and check if element you try to find is available or not:
author.get_text(strip=True) if (author := e.find('span', attrs={'class':'color-secondary'})) else None
Note: walrus operator requires Python 3.8 or later to work.
Alternative without walrus operater
:
e.find('span', attrs={'class':'color-secondary'}).get_text(strip=True) if e.find('span', attrs={'class':'color-secondary'}) else None
Example
Instead of differnet lists
for every attribute this example demonstrates how to use a single one with a dict
for each article, to use a more structured way, storing the results:
from bs4 import BeautifulSoup
html='''
<div class="article-container">
<div class="row">
<span class="color-primary">
Title
</span>
</div>
<div class="row">
<span class="color-secondary">
Author Name
</span>
</div>
</div>
<div class="article-container">
<div class="row">
<span class="color-primary">
Title
</span>
</div>
</div>
'''
soup = BeautifulSoup(html)
data = []
for e in soup.find_all('div', attrs={'class':'article-container'}):
data.append({
'title': e.span.get_text(strip=True),
'author': author.get_text(strip=True) if (author := e.find('span', attrs={'class':'color-secondary'})) else None
})
data
Output
[{'title': 'Title', 'author': 'Author Name'},
{'title': 'Title', 'author': None}]
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.