Issue
import os
from bs4 import BeautifulSoup
# Get a list of all .htm files in the HTML_bak folder
html_files = [file for file in os.listdir('HTML_bak') if file.endswith('.htm')]
# Loop through each HTML file
for file_name in html_files:
input_file_path = os.path.join('HTML_bak', file_name)
output_file_path = os.path.join('HTML', file_name)
# Read the input file with errors='ignore'
with open(input_file_path, 'r', encoding='utf-8', errors='ignore') as input_file:
input_content = input_file.read()
# Parse the input content using BeautifulSoup with html5lib parser
soup = BeautifulSoup(input_content, 'html5lib')
main_content = soup.find('div', style='position:initial;float:left;text-align:left;overflow-wrap:break-word !important;width:98%;margin-left:5px;background-color:#FFFFFF;color:black;')
# Overwrite the output file with modified content
with open(output_file_path, 'w', encoding='utf-8') as output_file:
output_file.write(str(main_content))
This code correctly scans HTML files in a folder and only pulls in the desired div
based on style
. However, there are sometimes tags within this div
tag that I want to remove. Those tags appear as:
<div class="gmail_quote">2010/2/11 some text here .... </div>
How can I edit my code to also remove these tags with gmail_quote
class?
Update 8/29/23:
I am copying an example HTML content to make sure my question is clear. I want to keep the contents of the <div style="position:initial....
after <body bgColor=#ffffff>
and remove the contents of the <div class="gmail_quote">2010/2/11 ...
<html><body style="background-color:#FFFFFF;"><div></div></body></html><article style="width:100%;float:left; position:left;background-color:#FFFFFF; margin: 0mm 0mm 0mm 0mm; "><style>
@media print {
pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap; white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
}pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap; white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
@page {size: auto; margin: 12mm 4mm 12mm 6mm; }
</style>
<div style="position:initial;float:left;background-color:transparent;text-align:left;width:100%;margin-left:5px;">
<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8;"><style>
.hdrfldname{color:black;font-size:20px; line-height:120%;}
.hdrfldtext{overflow-wrap:break-word;color:black;font-size:20px;line-height:120%;}
</style></head>
<body bgColor=#ffffff>
<div style="position:initial;float:left;text-align:left;font-weight:normal;width:100%;background-color:#eee9e9;">
<span class='hdrfldname'>SUBJECT: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>FROM: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>TO: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>DATE: </span><span class='hdrfldtext'>2010/02/12 09:10</span><br>
</div></body></html>
</div>
<div style="position:initial;float:left;text-align:left;overflow-wrap:break-word !important;width:98%;margin-left:5px;background-color:#FFFFFF;color:black;"><br>
<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8;"><style>
pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap; white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
</style></head><body bgColor=#ffffff>
<div> lorem ipsum </div>
<div class="gmail_quote">2010/2/11 lorem ipsum<span dir="ltr"><<a style="max-width:100%;" href="lorem ipsum">lorem ipsum</a>></span><br>
</body></html>
</div>
</article>
<div> <br></div>
Solution
You can modify your code to remove the div
tags with the gmail_quote
class by using the decompose()
method of the BeautifulSoup library. This method removes a tag from the tree and then completely destroys it and its contents.
# Find and remove all div tags with class gmail_quote
for tag in main_content.find_all('div', {'class': 'gmail_quote'}):
tag.decompose()
Add this code below your main_content
line.
This should remove all div
tags with the gmail_quote
class from the main_content
before writing it to the output file.
Answered By - Razzer
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.