Tuesday, January 30, 2024

[FIXED] Multiple tags for BeatifulSoup

January 30, 2024 beautifulsoup, html, python No comments

Issue

import os
from bs4 import BeautifulSoup

# Get a list of all .htm files in the HTML_bak folder
html_files = [file for file in os.listdir('HTML_bak') if file.endswith('.htm')]

# Loop through each HTML file
for file_name in html_files:
    input_file_path = os.path.join('HTML_bak', file_name)
    output_file_path = os.path.join('HTML', file_name)
    
    # Read the input file with errors='ignore'
    with open(input_file_path, 'r', encoding='utf-8', errors='ignore') as input_file:
        input_content = input_file.read()

    # Parse the input content using BeautifulSoup with html5lib parser
    soup = BeautifulSoup(input_content, 'html5lib')

    main_content = soup.find('div', style='position:initial;float:left;text-align:left;overflow-wrap:break-word !important;width:98%;margin-left:5px;background-color:#FFFFFF;color:black;')
  
    # Overwrite the output file with modified content
    with open(output_file_path, 'w', encoding='utf-8') as output_file:
        output_file.write(str(main_content))

This code correctly scans HTML files in a folder and only pulls in the desired div based on style. However, there are sometimes tags within this div tag that I want to remove. Those tags appear as:

<div class="gmail_quote">2010/2/11 some text here .... </div>

How can I edit my code to also remove these tags with gmail_quote class?

Update 8/29/23:

I am copying an example HTML content to make sure my question is clear. I want to keep the contents of the <div style="position:initial.... after <body bgColor=#ffffff> and remove the contents of the <div class="gmail_quote">2010/2/11 ...



<html><body style="background-color:#FFFFFF;"><div></div></body></html><article style="width:100%;float:left; position:left;background-color:#FFFFFF; margin: 0mm 0mm 0mm 0mm; "><style>
@media print {
pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap;  white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
}pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap;  white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
@page {size: auto; margin: 12mm 4mm 12mm 6mm; }
</style>
<div style="position:initial;float:left;background-color:transparent;text-align:left;width:100%;margin-left:5px;">
<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8;"><style>
.hdrfldname{color:black;font-size:20px; line-height:120%;}
.hdrfldtext{overflow-wrap:break-word;color:black;font-size:20px;line-height:120%;}
</style></head>
<body bgColor=#ffffff>
<div style="position:initial;float:left;text-align:left;font-weight:normal;width:100%;background-color:#eee9e9;">
<span class='hdrfldname'>SUBJECT: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>FROM: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>TO: </span><span class='hdrfldtext'>lorem ipsum</span><br>
<span class='hdrfldname'>DATE: </span><span class='hdrfldtext'>2010/02/12 09:10</span><br>
</div></body></html>
</div>
<div style="position:initial;float:left;text-align:left;overflow-wrap:break-word !important;width:98%;margin-left:5px;background-color:#FFFFFF;color:black;"><br>
<html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8;"><style>
pre { overflow-x:break-word; white-space:pre; white-space:hp-pre-wrap; white-space:-moz-pre-wrap; white-space:-o-pre-wrap;  white-space:-pre-wrap; white-space:pre-wrap; word-wrap:break-word;}
</style></head><body bgColor=#ffffff>
<div> lorem ipsum </div>

<div class="gmail_quote">2010/2/11 lorem ipsum<span dir="ltr">&lt;<a  style="max-width:100%;" href="lorem ipsum">lorem ipsum</a>&gt;</span><br>

</body></html>
</div>
</article>
<div>&nbsp;<br></div>

Solution

You can modify your code to remove the div tags with the gmail_quote class by using the decompose() method of the BeautifulSoup library. This method removes a tag from the tree and then completely destroys it and its contents.

    # Find and remove all div tags with class gmail_quote
    for tag in main_content.find_all('div', {'class': 'gmail_quote'}):
        tag.decompose()

Add this code below your main_content line.

This should remove all div tags with the gmail_quote class from the main_content before writing it to the output file.

Answered By - Razzer

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 30, 2024

[FIXED] Multiple tags for BeatifulSoup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels