Thursday, January 11, 2024

[FIXED] Replacing all text in a HTML with BeautifulSoup4, while keeping the original DOM structure

January 11, 2024 algorithm, beautifulsoup, dom, python No comments

Issue

I am trying to replace all text in a HTML document using Beautifulsoap4 in Python, including elements that have both text and other elements inside them. For instance I want
<div>text1<strong>text2</strong>text3</div> to become
<div>text1_changed<strong>text2_changed</strong>text3_changed</div>.

I am aware of the thread Faster way of replacing text in all dom elements?, however this uses Javascript, so the functions used are not available in Python. I would like to accomplish the same goal using native Python.

I have code that already works if all tags contain either tags or text (the rand_text function returns a random string):

from bs4 import BeautifulSoup as bs

def randomize(html):
    soup = bs(html, features='html.parser')
    elements = soup.find_all()

    for el in elements:
        if el.string == None:
            pass
        else:
            replacement = rand_text(el.text)
        el.string.replace_with(replacement)
    return soup

However this code will not work in the above example, when the element's "string" attribute is None, because it has both other elements and text inside.

I have also tried creating a new element if the "string" attribute is None and then replace the entire element:

from bs4 import BeautifulSoup as bs

def anonymize2(html):
    soup = bs(html, features='html.parser')
    elements = soup.find_all()
    for el in elements:
        replacement = rand_text(el.text)
        if el.string:
            el.string.replace_with(replacement)
        else:
            new_el = soup.new_tag(el.name)
            new_el.attrs = el.attrs
            for sub_el in el.contents:
                new_el.append(sub_el)
            new_el.string = replacement
            parent = el.parent
            if parent:
                if new_el not in soup:
                    soup.append(new_el)
                parent.replace_with(new_el)
    return soup

however this one gives the error "ValueError: Cannot replace one element with another when the element to be replaced is not part of a tree."
I think I am getting this error, because the algorithm already replaced the parent of the element it is trying to replace.

What logic could I implement to fix this?
Or how could I accomplish my original goal using a different method?

Solution

You can iterate over the contents of the element and check if each item is a string and then replace the string.

from bs4 import BeautifulSoup as bs

def randomize(html):
    soup = bs(html, features='html.parser')
    elements = soup.find_all()

    for el in elements:
        replacement = rand_text(el.text)
        if el.string:
            el.string.replace_with(replacement)
        else:
            for sub_el in el.contents:
                if isinstance(sub_el, str):
                    sub_el.replace_with(rand_text(sub_el))
    return soup

# defined for testing purposes. Replace this with your own logic
def rand_text(text):
    return text + "_changed"



html = "<div>text1<strong>text2</strong>text3</div>"
print(randomize(html))

Outputs:

<div>text1_changed<strong>text2_changed</strong>text3_changed</div>

Answered By - noah1400

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, January 11, 2024

[FIXED] Replacing all text in a HTML with BeautifulSoup4, while keeping the original DOM structure

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels