Saturday, February 12, 2022

[FIXED] All 'under' each H2

February 12, 2022 beautifulsoup, html, python, siblings No comments

Issue

Firstly: I understand the s are not really 'under' the <h2>s but are siblings here. I just needed to get the idea across in the Title.

My sample HTML looks like this:

<h1>Wildlife near me</h1>

<h2>Animals</h2>
<p>Here are some animals.</p>
<p class="wildlife">Grey Kangaroo</p>
<p>A bit about kangaroos</p>
<p class="wildlife">Koala</p>
<p>These are NOT bears!</p>

<h2>Snakes</h2>
<p>These can be very dangerous! Always take care</p>
<p class="wildlife">Eastern Brown</p>
<p>A very aggressive and venomous snake</p>
<p>A link here to an ad, so we don't want this bit</p>

I've been trying to use BeautifulSoup to create a listing (yes, Markdown,but not relevant here) that is just the 'key' info, something like:

# Animals
## Grey Kangaroo
A bit about kangaroos
## Koala
These are NOT bears!

# Snakes
## Eastern Brown
A very aggressive and venomous snake

How do I get all the Grey Kangaroo and it's next sibling..for each of the h2s? I've tried this:

for h2 in soup.find_all('h2'):
    print("#### ",h2.text)
    x = h2.find_next_siblings('p', class_='wildlife')
    for item in x:
        print("*",item.text,"*",sep="")
        print(item.find_next_sibling('p').text)
        print("    ")
    print("---")

But it goes too 'deep' on that first one (adding the 2nd one's data), then does the 2nd H2.

####  Animals
*Grey Kangaroo*
A bit about kangaroos
    
*Koala*
These are NOT bears!
    
*Eastern Brown*
A very aggressive and venomous snake
    
---
####  Snakes
*Eastern Brown*
A very aggressive and venomous snake
    
---

Can this be done? Thank you.

Solution

I like dicts to store structured information that could be reused in later proccessing.

So I select all  with class named .wildlife and iterate over to find_previous('h2') and find_next('p') and store information in data:

data = {}

for w in soup.select('h2~.wildlife'):
    
    if w.find_previous('h2').text not in data:
        data[w.find_previous('h2').text] = []
        
    data[w.find_previous('h2').text].append({
        'animal' : w.text,
        'note' : w.find_next('p').text
    })

Now you can process the data in the way you like:

for x in data:
    print('# '+ x)
    for a in data[x]:
        print('## ' + a['animal'])
        print(a['note'])
    print('------------------')

Example

import requests
from bs4 import BeautifulSoup

html='''
<h1>Wildlife near me</h1>

<h2>Animals</h2>
<p>Here are some animals.</p>
<p class="wildlife">Grey Kangaroo</p>
<p>A bit about kangaroos</p>
<p class="wildlife">Koala</p>
<p>These are NOT bears!</p>

<h2>Snakes</h2>
<p>These can be very dangerous! Always take care</p>
<p class="wildlife">Eastern Brown</p>
<p>A very aggressive and venomous snake</p>
<p>A link here to an ad, so we don't want this bit</p>
'''

soup = BeautifulSoup(html, 'lxml')

data = {}

for w in soup.select('h2~.wildlife'):
    
    if w.find_previous('h2').text not in data:
        data[w.find_previous('h2').text] = []
        
    data[w.find_previous('h2').text].append({
        'animal' : w.text,
        'note' : w.find_next('p').text
    })
    

for x in data:
    print('# '+ x)
    for a in data[x]:
        print('## ' + a['animal'])
        print(a['note'])
    print('------------------')

Output

# Animals
## Grey Kangaroo
A bit about kangaroos
## Koala
These are NOT bears!
------------------
# Snakes
## Eastern Brown
A very aggressive and venomous snake
------------------

EDIT

If you just like to print directly you can go with:

data = []

for w in soup.select('.wildlife'):
    
    h2 = w.find_previous('h2').text
    
    if h2 not in data:
        data.append(h2)
        print('------------------')
        print('# ' + h2)
        
    print ('## ' + w.text)
    print(w.find_next('p').text)

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, February 12, 2022

[FIXED] All <p class="blah"> 'under' each H2

Issue

Solution

Example

Output

EDIT

0 comments:

Post a Comment

Popular Posts

Labels