Issue
I am trying to parse through and extract data from HTML using Python and Beautifulsoup.
The sample HTML looks like this (it has multiple such structures being repeated:
Here is a sample code snippet that generated the above screenshot:
<div class="indented"><p id="1" class="">
<strong><strong><strong><strong><strong><strong><strong><strong><strong><strong>Regional</strong></strong></strong></strong></strong></strong></strong></strong></strong></strong>
</p>
<ul id="2" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about region 1
</li>
</ul>
<ul id="3" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about region 2</li>
</ul>
<p id="4" class="">
<strong><strong><strong><strong><strong><strong><strong><strong><strong>Country</strong></strong></strong></strong></strong></strong></strong></strong></strong>
</p>
<ul id="5" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about country 1
<ul id="ac" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1
</li>
</ul>
<ul id="ad" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1-2</li>
</ul>
<ul id="ae" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1-3</li>
</ul>
</li>
</ul>
<ul id="6" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about country 2
<ul id="ab" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 2
</li>
</ul>
</li>
</ul>
<p id="7" class=""><strong><strong><strong><strong><strong>City</strong></strong></strong></strong></strong></p>
<ul id="8" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about city 1</li>
<ul id="acc" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1
<ul id="add" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1-2</li>
</ul>
</li>
</ul>
<ul id="aee" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1-3</li>
</ul>
</ul>
<ul id="9" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about country 2
<figure id="a" class="image"><a
href="a.png"><img
style="width:3040px"
src="a.png"></a>
</figure>
</li>
</ul>
</div>
And here is a sample of how the above appears in Chrome's console:
My objective is to get all the nested text under the City <p> tag. You can assume that the City <p> tag will always be the last <p> tag before the <div> it sits in is closed out. Under the City <p> tag, bullet points can have multiple layers of nesting and may have text in different formats (eg: bolded, italicized and so on). Any non-text data like images should be ignored.
Given the HTML structure seen in the Chrome console above, unfortunately the <ul> elements are not direct children of the <p> tag, which is making it difficult for me to accomplish this task.
Anyone have advice on how I can do this programmatically? Thanks
Solution
IIUC, you can use tag.find_previous
to check if the tag you're in is under City
or not. E.g.:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, "html.parser") # html_text contains your snippet from the question
text = []
for ul in soup.select("ul:not(:has(ul))"):
p = ul.find_previous("p")
if p and p.text.strip() == "City":
text.append(ul.get_text(strip=True, separator=" "))
print("\n".join(text))
Prints:
Lorem ipsum about city 1
Lorem ipsum about country 2
EDIT:
from bs4 import BeautifulSoup
html_text = """\
<div class="indented"><p id="1" class="">
<strong><strong><strong><strong><strong><strong><strong><strong><strong><strong>Regional</strong></strong></strong></strong></strong></strong></strong></strong></strong></strong>
</p>
<ul id="2" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about region 1
</li>
</ul>
<ul id="3" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about region 2</li>
</ul>
<p id="4" class="">
<strong><strong><strong><strong><strong><strong><strong><strong><strong>Country</strong></strong></strong></strong></strong></strong></strong></strong></strong>
</p>
<ul id="5" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about country 1
<ul id="ac" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1
</li>
</ul>
<ul id="ad" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1-2</li>
</ul>
<ul id="ae" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1-3</li>
</ul>
</li>
</ul>
<ul id="6" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about country 2
<ul id="ab" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 2
</li>
</ul>
</li>
</ul>
<p id="7" class=""><strong><strong><strong><strong><strong>City</strong></strong></strong></strong></strong></p>
<ul id="8" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about city 1</li>
<ul id="acc" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1
<ul id="add" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1-2</li>
</ul>
</li>
</ul>
<ul id="aee" class="bulleted-list">
<li style="list-style-type:circle">Sub lorem ipsum about country 1-3</li>
</ul>
</ul>
<ul id="9" class="bulleted-list">
<li style="list-style-type:disc">Lorem ipsum about country 2
<figure id="a" class="image"><a
href="a.png"><img
style="width:3040px"
src="a.png"></a>
</figure>
</li>
</ul>
</div>"""
soup = BeautifulSoup(html_text, "html.parser")
def fn(ul, level=0):
for tag in ul.find_all(["li", "ul"], recursive=False):
if tag.name == "li":
print(
"\t" * level,
" ".join(c.strip() for c in tag.find_all(string=True, recursive=False)),
sep="",
)
for li_ul in tag.find_all("ul", recursive=False):
fn(li_ul, level + 1)
else:
fn(tag, level + 1)
for ul in soup.select("div > ul"):
fn(ul)
Prints:
Lorem ipsum about region 1
Lorem ipsum about region 2
Lorem ipsum about country 1
Sub lorem ipsum about country 1
Sub lorem ipsum about country 1-2
Sub lorem ipsum about country 1-3
Lorem ipsum about country 2
Sub lorem ipsum about country 2
Lorem ipsum about city 1
Sub lorem ipsum about country 1
Sub lorem ipsum about country 1-2
Sub lorem ipsum about country 1-3
Lorem ipsum about country 2
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.