Issue
My HTML code contains nested lists like this:
<ul>
<li>Apple</li>
<li>Pear</li>
<ul>
<li>Cherry</li>
<li>Orange</li>
<ul>
<li>Pineapple</li>
</ul>
</ul>
<li>Banana</li>
</ul>
I need to parse them so they look like this:
+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana
I tried using BeautifulSoup, but I am stuck on how to consider the nesting in my code.
Example, where x
contains the HTML code listed above:
import bs4
soup = bs4.BeautifulSoup(x, "html.parser")
for ul in soup.find_all("ul"):
for li in ul.find_all("li"):
li.replace_with("+ {}\n".format(li.text))
Solution
You can use recursion:
import bs4, re
from bs4 import BeautifulSoup as soup
s = """
<ul>
<li>Apple</li>
<li>Pear</li>
<ul>
<li>Cherry</li>
<li>Orange</li>
<ul>
<li>Pineapple</li>
</ul>
</ul>
<li>Banana</li>
</ul>
"""
def indent(d, c = 0):
if (s:=''.join(i for i in d.contents if isinstance(i, bs4.NavigableString) and i.strip())):
yield f'{"+"*c} {s}'
for i in d.contents:
if not isinstance(i, bs4.NavigableString):
yield from indent(i, c+1)
print('\n'.join(indent(soup(s, 'html.parser').ul)))
Output:
+ Apple
+ Pear
++ Cherry
++ Orange
+++ Pineapple
+ Banana
Answered By - Ajax1234
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.