Wednesday, April 6, 2022

[FIXED] Parsing and modifying content with Beautiful Soup (bs4)

April 06, 2022 beautifulsoup, html, python-3.x No comments

Issue

Goal is to modify existing html's content only.

For example, given current markup:

<html lang="en" op="item">
  <head>
    <meta name="referrer" content="origin">  
    <title>The Scientific Case for Two Spaces After a Period (2018)</title>
  </head>
  <body>
    <center>
        <table class="fatitem" border="0">
          <tr class='athing' id='25581282'>
            <td class="title">
              <a class="titlelink">The Scientific Case for Two Spaces After a Period (2018)</a>
            </td>
          </tr>
        </table>
    </center>  
  </body> 
</html>

Suppose, I want to append "™" string to each word which length is 6.

The result expected:

<html lang="en" op="item">
  <head>
    <meta name="referrer" content="origin">  
    <title>The Scientific Case for Two Spaces&#x2122; After a Period&#x2122; (2018)</title>
  </head>
  <body>
    <center>
        <table class="fatitem" border="0">
          <tr class='athing' id='25581282'>
            <td class="title">
              <a class="titlelink">The Scientific Case for Two Spaces&#x2122; After a Period&#x2122; (2018)</a>
            </td>
          </tr>
        </table>
    </center>  
  </body> 
</html>

I'm fairly new to python, and having trouble with this. Because of nested contents, I'm struggling with properly accessing the elements and returning expected outcome.

This is what I have tried so far:

    soup = BeautifulSoup(markup, 'html.parser')
    new_html = []
    
    for tags in soup.contents:
        for tag in tags:
            if type(tag) != str:
                split_tag = re.split(r"(\W+)", str(tag.string))
                for word in split_tag:
                    if len(word) == 6 and  word.isalpha():
                        word += "&#x2122;"
                tag.string = "".join(split_tag)
            else:
                str_obj.append(tag)
            new_html.append(str(tag))

Solution

You can use .find_all(text=True) in combination with .replace_with():

import re
from bs4 import BeautifulSoup

html_doc = """
<html lang="en" op="item">
  <head>
    <meta name="referrer" content="origin">  
    <title>The Scientific Case for Two Spaces After a Period (2018)</title>
  </head>
  <body>
    <center>
        <table class="fatitem" border="0">
          <tr class='athing' id='25581282'>
            <td class="title">
              <a class="titlelink">The Scientific Case for Two Spaces After a Period (2018)</a>
            </td>
          </tr>
        </table>
    </center>  
  </body> 
</html>
"""

soup = BeautifulSoup(html_doc, "html.parser")


for s in soup.find_all(text=True):
    new_s = re.sub(r"([a-zA-Z]{6,})", r"\1™", s)
    s.replace_with(new_s)

print(soup.prettify())

# to have HTML entities:
# print(soup.prettify(formatter="html"))

Prints:

<html lang="en" op="item">
 <head>
  <meta content="origin" name="referrer"/>
  <title>
   The Scientific™ Case for Two Spaces™ After a Period™ (2018)
  </title>
 </head>
 <body>
  <center>
   <table border="0" class="fatitem">
    <tr class="athing" id="25581282">
     <td class="title">
      <a class="titlelink">
       The Scientific™ Case for Two Spaces™ After a Period™ (2018)
      </a>
     </td>
    </tr>
   </table>
  </center>
 </body>
</html>

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, April 6, 2022

[FIXED] Parsing and modifying content with Beautiful Soup (bs4)

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels