Issue
I want to extract only the text from the top-most element of my soup; however soup.text gives the text of all the child elements as well:
I have
import BeautifulSoup
soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>')
print soup.text
The output to this is yesno
. I want simply 'yes'.
What's the best way of achieving this?
Edit: I also want yes
to be output when parsing '<html><b>no</b>yes</html>
'.
Solution
In modern (as of 2023-06-17) BeautifulSoup4, given:
from bs4 import BeautifulSoup
node = BeautifulSoup("""
<html>
<div>
<span>A</span>
B
<span>C</span>
D
</div>
</html>""").find('div')
Use the following to get direct children text elements (BD
):
s = "".join(node.find_all(string=True, recursive=False))
And the following to get all descendants text elements (ABCD
):
s = "".join(node.find_all(string=True, recursive=True))
Answered By - robertspierre
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.