Issue
SO i am having issue with extracting values from a security advisory, a section is as below
<ul>
<li>
ABC
<ul>
<li>
XYZ
</li>
<li>
PQR
</li>
when i do a find_all on li and iterate and print it i get
ABCXYZPQR
XYZ
PQR
instead of what i want to get is
ABC
XYZ
PQR
i understand this is happening because the li for ABC is not closed so it is reading the whole sublist
what i am unble to figure out is how do i extract just ABC, there is not way to split it also post converting it to string because there is no common thing to split it on
Solution
You can use a recursive generator function:
import bs4
def get_li(d):
if d.name == 'li':
yield ''.join(str(i).strip() for i in d.contents if isinstance(i, bs4.NavigableString))
for i in d.contents:
if not isinstance(i, bs4.NavigableString):
yield from get_li(i)
source = bs4.BeautifulSoup(html, 'html.parser')
print(list(get_li(source)))
Output:
['ABC', 'XYZ', 'PQR']
Answered By - Ajax1234
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.