Issue
I am trying to replace all text in a HTML document using Beautifulsoap4 in Python, including elements that have both text and other elements inside them. For instance I want
<div>text1<strong>text2</strong>text3</div>
to become
<div>text1_changed<strong>text2_changed</strong>text3_changed</div>
.
I am aware of the thread Faster way of replacing text in all dom elements?, however this uses Javascript, so the functions used are not available in Python. I would like to accomplish the same goal using native Python.
I have code that already works if all tags contain either tags or text (the rand_text function returns a random string):
from bs4 import BeautifulSoup as bs
def randomize(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
if el.string == None:
pass
else:
replacement = rand_text(el.text)
el.string.replace_with(replacement)
return soup
However this code will not work in the above example, when the element's "string" attribute is None, because it has both other elements and text inside.
I have also tried creating a new element if the "string" attribute is None and then replace the entire element:
from bs4 import BeautifulSoup as bs
def anonymize2(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
replacement = rand_text(el.text)
if el.string:
el.string.replace_with(replacement)
else:
new_el = soup.new_tag(el.name)
new_el.attrs = el.attrs
for sub_el in el.contents:
new_el.append(sub_el)
new_el.string = replacement
parent = el.parent
if parent:
if new_el not in soup:
soup.append(new_el)
parent.replace_with(new_el)
return soup
however this one gives the error "ValueError: Cannot replace one element with another when the element to be replaced is not part of a tree."
I think I am getting this error, because the algorithm already replaced the parent of the element it is trying to replace.
What logic could I implement to fix this?
Or how could I accomplish my original goal using a different method?
Solution
You can iterate over the contents
of the element and check if each item is a string and then replace the string.
from bs4 import BeautifulSoup as bs
def randomize(html):
soup = bs(html, features='html.parser')
elements = soup.find_all()
for el in elements:
replacement = rand_text(el.text)
if el.string:
el.string.replace_with(replacement)
else:
for sub_el in el.contents:
if isinstance(sub_el, str):
sub_el.replace_with(rand_text(sub_el))
return soup
# defined for testing purposes. Replace this with your own logic
def rand_text(text):
return text + "_changed"
html = "<div>text1<strong>text2</strong>text3</div>"
print(randomize(html))
Outputs:
<div>text1_changed<strong>text2_changed</strong>text3_changed</div>
Answered By - noah1400
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.