Issue
I have a page like this:
...
<div class="myclass">
<p>
text 1 to keep<span>text 1 to remove</span>and keep this too.
</p>
<p>
text 2 to keep<span>text 2 to remove</span>and keep this too.
</p>
<div>
I.e.: I want to remove all <span>
tags from any <p>
element from bs4 (BeautifulSoup in Python3).
Currently this is my code:
from bs4 import BeautifulSoup
...
text = ""
for tag in soup.find_all(attrs={"class": "myclass"}):
text += tag.p.text
And of course I get all text in spans too...
I read I should use unwrap()
or decompose()
but I really do not understand how to use them in practice in my use-case...
All similar Q/A do not help...
Solution
You can try:
from bs4 import BeautifulSoup
html_text = """\
<div class="myclass">
<p>
text 1 to keep<span>text 1 to remove</span>and keep this too.
</p>
<p>
text 2 to keep<span>text 2 to remove</span>and keep this too.
</p>
<div>"""
soup = BeautifulSoup(html_text, "html.parser")
for span in soup.select("p span"):
span.replace_with(" ") # or span.extract()
soup.smooth()
print(soup.prettify())
Prints:
<div class="myclass">
<p>
text 1 to keep and keep this too.
</p>
<p>
text 2 to keep and keep this too.
</p>
<div>
</div>
</div>
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.