Issue
I'm trying to scrape a page that looks like this:
<article>
<h1>
<a href="site.com/?ref=search" title="Read">Title</a>
</h1>
<p> Summury </p>
<em class="author">Name</em>
<aside class="correlati">
<a href="site.com?ref=search" target="_blank">Site.com</a>
<a href="site.com/2022/01/01">
<time datetime="2022-01-01">01 January 2022</time>
</a>
</aside>
</article>
This is my code, I'm using bs4:
data = soup.find_all("article", attrs={"class":None})
It kinda works, but right now, since there is not class, I'm getting all the information stored in long string and I'm having troubles understanding how to get the specific information I want in different variable. I'd like to create a list that contains all the <h1> title
, all the <p> summary
and all the <time datetime> date
and so on. How can I do so?
Thank you all!
Solution
Maybe something like this:
from bs4 import BeautifulSoup
page = '''
<article>
<h1>
<a href="site.com/?ref=search" title="Read">Title</a>
</h1>
<p> Summury </p>
<em class="author">Name</em>
<aside class="correlati">
<a href="site.com?ref=search" target="_blank">Site.com</a>
<a href="site.com/2022/01/01">
<time datetime="2022-01-01">01 January 2022</time>
</a>
</aside>
</article>
<article>
<h1>
<a href="site.com/?ref=search" title="Read">Title2</a>
</h1>
<p> Summury 2 </p>
<aside class="correlati">
<a href="site.com?ref=search" target="_blank">Site.com</a>
<a href="site.com/2022/01/01">
<time datetime="2022-01-01">02 January 2022</time>
</a>
</aside>
</article>
'''
soup = BeautifulSoup(page, 'html.parser')
articles = soup.find_all('article')
# this creates three different lists for each tag element
titles = []
summaries = []
times = []
authors = []
for article in articles:
# title = article.find('h1').text
title = article.h1.a.text.strip()
titles.append(title)
# summary = article.find('p').text
summary = article.p.text.strip()
summaries.append(summary)
try:
author = article.em.text.strip()
authors.append(author)
except AttributeError:
author = 'Unknown'
authors.append(author)
# time = article.find('time').text
time = article.time.text.strip()
times.append(time)
print(titles, summaries, times)
# this creates one list for all the tags
list = []
for article in articles:
# title = article.find('h1').text
title = article.h1.a.text.strip()
list.append(title)
# summary = article.find('p').text
summary = article.p.text.strip()
list.append(summary)
try:
author = article.em.text.strip()
list.append(author)
except AttributeError:
author = 'Unknown'
list.append(author)
# time = article.find('time').text
time = article.time.text.strip()
list.append(time)
print(list)
let me know if this suits your needs ;)
Answered By - Edoardo Balducci
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.