Issue
How to decompose and smooth tags from a BeautifulSoup object?
Not from string.
From a soup, to a soup without going to a string.
The docs suggest using the smooth()
method to eliminate undesired blank spaces. Can you show me?
from bs4 import BeautifulSoup
dml = '''<html>
<head>
<title>TITLE</title>
</head>
<body>LOOSE TEXT
<div></div>
<p></p>
<div>MORE TEXT</div>
<b></b>
<i></i> # COMMENT
</body>
</html>'''
soup = BeautifulSoup(dml, features='lxml')
def strip_empty_tags(soup:BeautifulSoup):
for item in soup.find_all():
if not item.get_text(strip=True):
item.decompose()
soup.smooth() # How to .smooth()?
return soup
strip_empty_tags(soup)
<html>
<head>
<title>TITLE</title>
</head>
<body>LOOSE TEXT
<div>MORE TEXT</div>
# COMMENT
</body>
</html>
The decompose()
and extract()
methods make undesired blank spaces/blank lines appear. I want to get rid of them. But no I do not want to ''.join([string for string in string_list])
.
There are precedents to this question, in particular: [1], [2]. But all suggestions involve converting the BeautifulSoup object to a string. I can do that, I'm already doing that, but I don't want to do that.
This site has many other references to BeautifulSoup and "remove empty spaces", but most of them deal with situations where the text content has empty spaces to begin with. In my situation, the empty spaces are a by-product of BeautifulSoup's decompose/extract methods. I'd like to remove them immediately after they are created in the loop.
I am using the 'lxml'
parser and don't plan to change, unless absolutely necessary.
Solution
You can extract empty tags with tag.replace_with('')
, then do parent.smooth()
and replace all empty characters at the end of the string with re.sub
.
For example:
import re
from bs4 import BeautifulSoup
dml = '''<html>
<head>
<title>TITLE</title>
</head>
<body>LOOSE TEXT
<div></div>
<p></p>
<div>MORE TEXT</div>
<b></b>
<i></i> # COMMENT
</body>
</html>'''
soup = BeautifulSoup(dml, features='lxml')
def strip_empty_tags(soup:BeautifulSoup):
for item in soup.find_all():
if not item.get_text(strip=True):
p = item.parent
item.replace_with('')
p.smooth()
for c in p.find_all(text=True):
c.replace_with(re.sub(r'\s{2,}$', '\n', c))
return soup
print( strip_empty_tags(soup) )
Prints:
<html>
<head>
<title>TITLE</title>
</head>
<body>LOOSE TEXT
<div>MORE TEXT</div>
# COMMENT
</body>
</html>
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.