Issue
Goal is to modify existing html's content only.
For example, given current markup:
<html lang="en" op="item">
<head>
<meta name="referrer" content="origin">
<title>The Scientific Case for Two Spaces After a Period (2018)</title>
</head>
<body>
<center>
<table class="fatitem" border="0">
<tr class='athing' id='25581282'>
<td class="title">
<a class="titlelink">The Scientific Case for Two Spaces After a Period (2018)</a>
</td>
</tr>
</table>
</center>
</body>
</html>
Suppose, I want to append "™"
string to each word which length is 6.
The result expected:
<html lang="en" op="item">
<head>
<meta name="referrer" content="origin">
<title>The Scientific Case for Two Spaces™ After a Period™ (2018)</title>
</head>
<body>
<center>
<table class="fatitem" border="0">
<tr class='athing' id='25581282'>
<td class="title">
<a class="titlelink">The Scientific Case for Two Spaces™ After a Period™ (2018)</a>
</td>
</tr>
</table>
</center>
</body>
</html>
I'm fairly new to python, and having trouble with this. Because of nested contents, I'm struggling with properly accessing the elements and returning expected outcome.
This is what I have tried so far:
soup = BeautifulSoup(markup, 'html.parser')
new_html = []
for tags in soup.contents:
for tag in tags:
if type(tag) != str:
split_tag = re.split(r"(\W+)", str(tag.string))
for word in split_tag:
if len(word) == 6 and word.isalpha():
word += "™"
tag.string = "".join(split_tag)
else:
str_obj.append(tag)
new_html.append(str(tag))
Solution
You can use .find_all(text=True)
in combination with .replace_with()
:
import re
from bs4 import BeautifulSoup
html_doc = """
<html lang="en" op="item">
<head>
<meta name="referrer" content="origin">
<title>The Scientific Case for Two Spaces After a Period (2018)</title>
</head>
<body>
<center>
<table class="fatitem" border="0">
<tr class='athing' id='25581282'>
<td class="title">
<a class="titlelink">The Scientific Case for Two Spaces After a Period (2018)</a>
</td>
</tr>
</table>
</center>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for s in soup.find_all(text=True):
new_s = re.sub(r"([a-zA-Z]{6,})", r"\1™", s)
s.replace_with(new_s)
print(soup.prettify())
# to have HTML entities:
# print(soup.prettify(formatter="html"))
Prints:
<html lang="en" op="item">
<head>
<meta content="origin" name="referrer"/>
<title>
The Scientific™ Case for Two Spaces™ After a Period™ (2018)
</title>
</head>
<body>
<center>
<table border="0" class="fatitem">
<tr class="athing" id="25581282">
<td class="title">
<a class="titlelink">
The Scientific™ Case for Two Spaces™ After a Period™ (2018)
</a>
</td>
</tr>
</table>
</center>
</body>
</html>
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.