Issue
I need to iterate invalid HTML and obtain a text value from all tags to change it.
from bs4 import BeautifulSoup
html_doc = """
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for tag in soup.find_all():
print(tag.name)
if tag.string:
tag.string.replace_with("1")
print(soup)
The result is
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>1</strong><br/>
Otevřeno: <strong>1</strong>, denně</p>
</span></div>
I know how to replace the text but bs won´t find the text of the paragraph tag. So the texts "Začátek sklizně:" and "Otevřeno:" and ", denně" are not found so I cannot replace them.
I tried using different parsers such as lxml and html5lib won´t make a difference. I tried python´s HTML library but that doesn´t support changing HTML only iterating it.
Solution
.string
returns on a tag type object a NavigableString
type object -> Your tag has a single string child then returned value is that string, if
it has no children or more than one child it will return None
.
Scenario is not quiet clear to me, but here is one last approach based on your comment:
I need generic code to iterate any html and find all texts so I can work with them.
for tag in soup.find_all(text=True):
tag.replace_with('1')
Example
from bs4 import BeautifulSoup
html_doc = """<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.find_all(text=True):
tag.replace_with('1')
Output
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">1<div class="oxy-expand-collapse-icon" href="#"></div>1<div class="oxy-toggle-content">1<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3>1</div>1</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>1<strong>1</strong><br/>1<strong>1</strong>1</p>1</span></div>
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.