Issue
Sample content:
<div id="content">
<h5>Title1</h5>
<div class="text">text 1</div>
<h5>Title2</h5>
<h6>SubTitle</h6>
<otherTag>bla bla</otherTag>
<div class="text">text 2</div>
<div class='pi'>post item</div>
<div class="text">text 3</div>
<div class="text">text 4</div>
</div>
Inside id content
, I need to get class text
, then get <h5>
, <h6>
, <otherTag>
, <div class='pi'>
which are belong to class text
.
So my way is to get the class text
, then get those things above it with find_all_previous
until meet another class text
or goes to the top id content
. Problem is that find_all_previous
returns all the previous contents. How can I make it stop searching at previous class text
or at id content
? And I don't think it's a good idea to use this method, since each searching returns all contents.
Using find_previous
is not a good choice either, it has to detect elements one by one, and the elements is in no order, and some even absents.
html = BeautifulSoup(response.text,'lxml')
content = html.find('div',{'id': 'content'})
paras = content.find_all('div', {'class': 'text'})
for para in paras:
print(para.get_text())
all_prevs = para.find_all_previous()
Edited
The result should grouped by class text
, for example:
Title2, SubTitle2, bla bla, Text2
Solution
First, select all .text
elements:
text_elements = dom.select('.text')
Then use itertools.takewhile()
on .find_previous_siblings()
to only take until we encounter another .text
:
def is_not_text(element):
return 'text' not in element.attrs.get('class', [])
other_elements = [
[*takewhile(is_not_text, text.find_previous_siblings(True))]
for text in text_elements
]
'''
[
[<h5>Title1</h5>],
[
<othertag>bla bla</othertag>,
<h6>SubTitle</h6>,
<h5>Title2</h5>
],
[<div class="pi">post item</div>],
[]
]
'''
You can also get the results as a dict if you so choose:
other_elements = {
text: [*takewhile(is_not_text, text.find_previous_siblings(True))]
for text in text_elements
}
'''
{
<div class="text">text 1</div>: [
<h5>Title1</h5>
],
<div class="text">text 2</div>: [
<othertag>bla bla</othertag>,
<h6>SubTitle</h6>,
<h5>Title2</h5>
],
<div class="text">text 3</div>: [
<div class="pi">post item</div>
],
<div class="text">text 4</div>: []
}
'''
Answered By - InSync
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.