Issue
I am trying to get all the <p>
that come after <h2>
.
I know how to do this in case I have only one <p>
after <h2>
, but not in case I have multiple <p>
.
Here's an example of the webpage:
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
....
I need to get all paragraphs in relation to their headings, e.g. Paragraphs 1 and 2 that are related to Heading Text1.
I'm trying that using BeautifulSoup
with Python
, been trying for days, also googling.
How can this be done?
Solution
You could get your goal while working with a dict
and .find_previous()
- Iterate all <p>
, find its previous <h2>
and set it as key in your dict
, than simply append the texts to its list
:
d = {}
for p in soup.select('p'):
if p.find_previous('h2'):
if d.get(p.find_previous('h2').text) == None:
d[p.find_previous('h2').text]=[]
else:
continue
d[p.find_previous('h2').text].append(p.text)
Example
from bs4 import BeautifulSoup
html = '''
<p>Any Other Paragraph</p>
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
'''
soup = BeautifulSoup(html)
d = {}
for p in soup.select('p'):
if p.find_previous('h2'):
if d.get(p.find_previous('h2').text) == None:
d[p.find_previous('h2').text]=[]
else:
continue
d[p.find_previous('h2').text].append(p.text)
d
Output
{'Heading Text1': ['Paragraph1', 'Paragraph2'],
'Heading Text2': ['Paragraph3', 'Paragraph4', 'Paragraph5']}
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.