Issue
Firstly: I understand the <p>s
are not really 'under' the <h2>s
but are siblings here. I just needed to get the idea across in the Title.
My sample HTML looks like this:
<h1>Wildlife near me</h1>
<h2>Animals</h2>
<p>Here are some animals.</p>
<p class="wildlife">Grey Kangaroo</p>
<p>A bit about kangaroos</p>
<p class="wildlife">Koala</p>
<p>These are NOT bears!</p>
<h2>Snakes</h2>
<p>These can be very dangerous! Always take care</p>
<p class="wildlife">Eastern Brown</p>
<p>A very aggressive and venomous snake</p>
<p>A link here to an ad, so we don't want this bit</p>
I've been trying to use BeautifulSoup to create a listing (yes, Markdown,but not relevant here) that is just the 'key' info, something like:
# Animals
## Grey Kangaroo
A bit about kangaroos
## Koala
These are NOT bears!
# Snakes
## Eastern Brown
A very aggressive and venomous snake
How do I get all the <p class="wildlife">Grey Kangaroo</p>
and it's next sibling..for each of the h2s? I've tried this:
for h2 in soup.find_all('h2'):
print("#### ",h2.text)
x = h2.find_next_siblings('p', class_='wildlife')
for item in x:
print("*",item.text,"*",sep="")
print(item.find_next_sibling('p').text)
print(" ")
print("---")
But it goes too 'deep' on that first one (adding the 2nd one's data), then does the 2nd H2.
#### Animals
*Grey Kangaroo*
A bit about kangaroos
*Koala*
These are NOT bears!
*Eastern Brown*
A very aggressive and venomous snake
---
#### Snakes
*Eastern Brown*
A very aggressive and venomous snake
---
Can this be done? Thank you.
Solution
I like dicts
to store structured information that could be reused in later proccessing.
So I select all <p>
with class
named .wildlife
and iterate over to find_previous('h2')
and find_next('p')
and store information in data
:
data = {}
for w in soup.select('h2~.wildlife'):
if w.find_previous('h2').text not in data:
data[w.find_previous('h2').text] = []
data[w.find_previous('h2').text].append({
'animal' : w.text,
'note' : w.find_next('p').text
})
Now you can process the data in the way you like:
for x in data:
print('# '+ x)
for a in data[x]:
print('## ' + a['animal'])
print(a['note'])
print('------------------')
Example
import requests
from bs4 import BeautifulSoup
html='''
<h1>Wildlife near me</h1>
<h2>Animals</h2>
<p>Here are some animals.</p>
<p class="wildlife">Grey Kangaroo</p>
<p>A bit about kangaroos</p>
<p class="wildlife">Koala</p>
<p>These are NOT bears!</p>
<h2>Snakes</h2>
<p>These can be very dangerous! Always take care</p>
<p class="wildlife">Eastern Brown</p>
<p>A very aggressive and venomous snake</p>
<p>A link here to an ad, so we don't want this bit</p>
'''
soup = BeautifulSoup(html, 'lxml')
data = {}
for w in soup.select('h2~.wildlife'):
if w.find_previous('h2').text not in data:
data[w.find_previous('h2').text] = []
data[w.find_previous('h2').text].append({
'animal' : w.text,
'note' : w.find_next('p').text
})
for x in data:
print('# '+ x)
for a in data[x]:
print('## ' + a['animal'])
print(a['note'])
print('------------------')
Output
# Animals
## Grey Kangaroo
A bit about kangaroos
## Koala
These are NOT bears!
------------------
# Snakes
## Eastern Brown
A very aggressive and venomous snake
------------------
EDIT
If you just like to print directly you can go with:
data = []
for w in soup.select('.wildlife'):
h2 = w.find_previous('h2').text
if h2 not in data:
data.append(h2)
print('------------------')
print('# ' + h2)
print ('## ' + w.text)
print(w.find_next('p').text)
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.