Issue
Some items have title but some don't, sample html like this:
<div id="content">
<h5>Title1</h5>
<div class="text">text 1</div>
<h5>Title2</h5>
<div class="text">text 2</div>
<div class="text">text 3</div>
<div class="text">text 4</div>
</div>
Tried to get all the class text
, and get their titles h5
(if any).
find_previous_sibling
can get the title, but the last two text
also list the title which is not owned by them.
and also tried previous_sibling
, then judge whether it is h5
or div
, h5
as title, but it returns nothing.
html = BeautifulSoup(response.text,'lxml')
content = html.find('div',{'id': 'content'})
paras = content.find_all('div', {'class': 'text'})
for para in paras:
title = p.find_previous_sibling('h5')
if title:
print(title.get_text())
pr = para.previous_sibling
if pr:
print(pr)
Solution
You could use find_previous()
without any params to get the DOM element before the div
, then use .name
to check if it's a <h5>
:
from bs4 import BeautifulSoup
html = """
<div id="content">
<h5>Title1</h5>
<div class="text">text 1</div>
<h5>Title2</h5>
<div class="text">text 2</div>
<div class="text">text 3</div>
<div class="text">text 4</div>
</div>
"""
html = BeautifulSoup(html,'html.parser')
content = html.find('div',{'id': 'content'})
paras = content.find_all('div', {'class': 'text'})
for para in paras:
print(para.text)
prev = para.find_previous()
if prev and prev.name == 'h5':
print(prev.text)
Gives:
text 1
Title1
text 2
Title2
text 3
text 4
Answered By - 0stone0
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.