Thursday, November 10, 2022

[FIXED] How to decompose and smooth tags from a BeautifulSoup object?

November 10, 2022 beautifulsoup, python-3.x No comments

Issue

How to decompose and smooth tags from a BeautifulSoup object?

Not from string.

From a soup, to a soup without going to a string.

The docs suggest using the smooth() method to eliminate undesired blank spaces. Can you show me?

from bs4 import BeautifulSoup
dml = '''<html>
<head>
    <title>TITLE</title>
</head>
<body>LOOSE TEXT
    <div></div>
    <p></p>
    <div>MORE TEXT</div>
    <b></b>
    <i></i> # COMMENT
</body>
</html>'''

soup = BeautifulSoup(dml, features='lxml')
def strip_empty_tags(soup:BeautifulSoup):
    for item in soup.find_all():
        if not item.get_text(strip=True):
            item.decompose()
            soup.smooth()  # How to .smooth()?
    return soup

strip_empty_tags(soup)
<html>
<head>
<title>TITLE</title>
</head>
<body>LOOSE TEXT


<div>MORE TEXT</div>

 # COMMENT
</body>
</html>

The decompose() and extract() methods make undesired blank spaces/blank lines appear. I want to get rid of them. But no I do not want to ''.join([string for string in string_list]).

There are precedents to this question, in particular: [1], [2]. But all suggestions involve converting the BeautifulSoup object to a string. I can do that, I'm already doing that, but I don't want to do that.

This site has many other references to BeautifulSoup and "remove empty spaces", but most of them deal with situations where the text content has empty spaces to begin with. In my situation, the empty spaces are a by-product of BeautifulSoup's decompose/extract methods. I'd like to remove them immediately after they are created in the loop.

I am using the 'lxml' parser and don't plan to change, unless absolutely necessary.

Solution

You can extract empty tags with tag.replace_with(''), then do parent.smooth() and replace all empty characters at the end of the string with re.sub.

For example:

import re
from bs4 import BeautifulSoup

dml = '''<html>
<head>
    <title>TITLE</title>
</head>
<body>LOOSE TEXT
    <div></div>
    <p></p>
    <div>MORE TEXT</div>
    <b></b>
    <i></i> # COMMENT
</body>
</html>'''

soup = BeautifulSoup(dml, features='lxml')
def strip_empty_tags(soup:BeautifulSoup):
    for item in soup.find_all():
        if not item.get_text(strip=True):
            p = item.parent
            item.replace_with('')
            p.smooth()
            for c in p.find_all(text=True):
                c.replace_with(re.sub(r'\s{2,}$', '\n', c))
    return soup


print( strip_empty_tags(soup) )

Prints:

<html>
<head>
<title>TITLE</title>
</head>
<body>LOOSE TEXT
<div>MORE TEXT</div>
 # COMMENT
</body>
</html>

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 10, 2022

[FIXED] How to decompose and smooth tags from a BeautifulSoup object?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels