Wednesday, April 6, 2022

[FIXED] Parsing invalid HTML and retrieving tag´s text to replace it

April 06, 2022 beautifulsoup, html, python, python-3.x No comments

Issue

I need to iterate invalid HTML and obtain a text value from all tags to change it.

from bs4 import BeautifulSoup

html_doc = """
<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
   <div class="oxy-expand-collapse-icon" href="#"></div>
   <div class="oxy-toggle-content">
    <h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3>   </div>
  </div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

for tag in soup.find_all():
    print(tag.name)
    if tag.string:
        tag.string.replace_with("1")

print(soup)

The result is

<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
<div class="oxy-expand-collapse-icon" href="#"></div>
<div class="oxy-toggle-content">
<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3> </div>
</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>1</strong><br/>
Otevřeno: <strong>1</strong>, denně</p>
</span></div>

I know how to replace the text but bs won´t find the text of the paragraph tag. So the texts "Začátek sklizně:" and "Otevřeno:" and ", denně" are not found so I cannot replace them.

I tried using different parsers such as lxml and html5lib won´t make a difference. I tried python´s HTML library but that doesn´t support changing HTML only iterating it.

Solution

.string returns on a tag type object a NavigableString type object -> Your tag has a single string child then returned value is that string, if it has no children or more than one child it will return None.

Scenario is not quiet clear to me, but here is one last approach based on your comment:

I need generic code to iterate any html and find all texts so I can work with them.

for tag in soup.find_all(text=True):
    tag.replace_with('1')

Example

from bs4 import BeautifulSoup

html_doc = """<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">
   <div class="oxy-expand-collapse-icon" href="#"></div>
   <div class="oxy-toggle-content">
    <h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">Sklizeň jahod 2019</span></h3>   </div>
  </div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>Začátek sklizně: <strong>Zahájeno</strong><br>
Otevřeno: <strong>6 h – do otrhání</strong>, denně</p>
</span></div>"""

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all(text=True):
    tag.replace_with('1')

Output

<div class="oxy-toggle toggle-7042 toggle-7042-expanded" data-oxy-toggle-active-class="toggle-7042-expanded" data-oxy-toggle-initial-state="closed" id="_toggle-212-142">1<div class="oxy-expand-collapse-icon" href="#"></div>1<div class="oxy-toggle-content">1<h3 class="ct-headline" id="headline-213-142"><span class="ct-span" id="span-225-142">1</span></h3>1</div>1</div><div class="ct-text-block" id="text_block-214-142"><span class="ct-span" id="span-230-142"><p>1<strong>1</strong><br/>1<strong>1</strong>1</p>1</span></div>

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, April 6, 2022

[FIXED] Parsing invalid HTML and retrieving tag´s text to replace it

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels