Issue
I have some htmls like below. How can I extract all text between content
and time
? The tag after content
may not be time
exactly. It could be some thing like location
.
...
<p><strong> name:</strong></p>
<p> sdfsdf </p>
<p><strong> content:</strong></p>
<p> yangben</p>
<p> dsfs </p>
<p> dfsds </p>
<p> sdfs </p>
<p><strong> time:</strong></p>
<p> 2020-10-10</p>
<p><strong> ll:</strong></p>
<p> 2020-10-10</p>
...
Solution
You can try:
from bs4 import BeautifulSoup
html_text = """\
<p><strong> name::</strong></p>
<p> sdfsdf </p>
<p><strong> content: </strong></p>
<p> yangben</p>
<p> dsfs </p>
<p> dfsds </p>
<p> sdfs </p>
<p><strong> time: </strong></p>
<p> 2020-10-10</p>
<p><strong> ll:</strong></p>
<p> 2020-10-10</p>
"""
soup = BeautifulSoup(html_text, "html.parser")
content_start = soup.select_one("p:has(strong:-soup-contains(content))")
all_tags = []
for tag in content_start.find_next_siblings():
prev_strong = tag.find_previous("strong")
if prev_strong and "content" in prev_strong.text and not tag.strong:
all_tags.append(tag)
print(all_tags)
Prints:
[<p> yangben</p>, <p> dsfs </p>, <p> dfsds </p>, <p> sdfs </p>]
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.