Monday, August 1, 2022

[FIXED] Scrape information with BeautifulSoup 4 when there is not a "class"

August 01, 2022 beautifulsoup, python No comments

Issue

I'm trying to scrape a page that looks like this:

<article>
 <h1>
   <a href="site.com/?ref=search" title="Read">Title</a>
 </h1>
 <p> Summury </p>
 <em class="author">Name</em>                
 <aside class="correlati"> 
    <a href="site.com?ref=search" target="_blank">Site.com</a>                                     
      <a href="site.com/2022/01/01">
    <time datetime="2022-01-01">01 January 2022</time>
      </a>                             
 </aside>              
</article>

This is my code, I'm using bs4:

data = soup.find_all("article", attrs={"class":None})

It kinda works, but right now, since there is not class, I'm getting all the information stored in long string and I'm having troubles understanding how to get the specific information I want in different variable. I'd like to create a list that contains all the <h1> title, all the <p> summary and all the <time datetime> date and so on. How can I do so? Thank you all!

Solution

Maybe something like this:

from bs4 import BeautifulSoup
page = '''
<article>
 <h1>
   <a href="site.com/?ref=search" title="Read">Title</a>
 </h1>
 <p> Summury </p>
 <em class="author">Name</em>                
 <aside class="correlati"> 
    <a href="site.com?ref=search" target="_blank">Site.com</a>                                     
      <a href="site.com/2022/01/01">
    <time datetime="2022-01-01">01 January 2022</time>
      </a>                             
 </aside>              
</article>
<article>
 <h1>
   <a href="site.com/?ref=search" title="Read">Title2</a>
 </h1>
 <p> Summury 2 </p>              
 <aside class="correlati"> 
    <a href="site.com?ref=search" target="_blank">Site.com</a>                                     
      <a href="site.com/2022/01/01">
    <time datetime="2022-01-01">02 January 2022</time>
      </a>                             
 </aside>              
</article>
'''

soup = BeautifulSoup(page, 'html.parser')
articles = soup.find_all('article')

# this creates three different lists for each tag element
titles = []
summaries = []
times = []
authors = []
for article in articles:
    # title = article.find('h1').text
    title = article.h1.a.text.strip()
    titles.append(title)
    # summary = article.find('p').text
    summary = article.p.text.strip()
    summaries.append(summary)
    try:
        author = article.em.text.strip()
        authors.append(author)
    except AttributeError:
        author = 'Unknown'
        authors.append(author)
    # time = article.find('time').text
    time = article.time.text.strip()
    times.append(time)

print(titles, summaries, times)

# this creates one list for all the tags
list = []
for article in articles:
        # title = article.find('h1').text
        title = article.h1.a.text.strip()
        list.append(title)
        # summary = article.find('p').text
        summary = article.p.text.strip()
        list.append(summary)
        try:
            author = article.em.text.strip()
            list.append(author)
        except AttributeError:
            author = 'Unknown'
            list.append(author)
        # time = article.find('time').text
        time = article.time.text.strip()
        list.append(time)

print(list)

let me know if this suits your needs ;)

Answered By - Edoardo Balducci

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, August 1, 2022

[FIXED] Scrape information with BeautifulSoup 4 when there is not a "class"

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels