Wednesday, January 12, 2022

[FIXED] How to read the source HTLM code from a locally saved HTML file using Python?

January 12, 2022 beautifulsoup, html, python No comments

Issue

I'm new to HTML and beautiful soup. I am trying to read a locally saved HTML file in Python and I tested the following code:

with open(file_path) as fp:
    soup = BeautifulSoup(fp)

print(soup)

The output looks weird and here is a part of it:

<html><body><p>ÿþh t m l &gt; 
 
 
 
 h e a d &gt; 
 
 m e t a   h t t p - e q u i v = C o n t e n t - T y p e   c o n t e n t = " t e x t / h t m l ;   c h a r s e t = u n i c o d e " &gt; 
 
 m e t a   n a m e = G e n e r a t o r   c o n t e n t = " M i c r o s o f t   W o r d   1 5   ( f i l t e r e d ) " &gt; 
 
 s t y l e &gt; 
 
 ! - - 
 
   / *   F o n t   D e f i n i t i o n s   * /

The original HTML code is something like

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=unicode">
<meta name=Generator content="Microsoft Word 15 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;

Can anyone help me or share some thoughts?

Thank you!

Solution

First of all, let's discuss why you are not able to fetch desired Output. It is because when you parsing data in BeautifulSoup. There might be some Spaces, Symbols, etc. presented in your Code. So, the appropriate Solution for this scenario was stated below:-

Needed Solution:- Use soup.prettify()
Appropriate Solution:- Use HTML Parser and soup.prettify() together

To Learn more about HTML Parser and soup.prettify:- Click Here

Approach 1 (By using `soup.prettify()` in your Current `Code`):-

# File Path of 'HTML' File
file_path = 'demo.html'

# Fetch 'HTML' Code Using 'BeautifulSoup'
with open(file_path) as fp:
    soup = BeautifulSoup(fp)

# Print 'HTML' Code using 'prettify' Format
print(soup.prettify())

# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

Approach 2 (By using `HTML Parser` and `soup.prettify()`):-

# Import all-important Libraries
from bs4 import BeautifulSoup
import html5lib

# Open Our 'HTML' File
html_page = open('demo.html', 'r')

# Parse it to 'HTML' Format
soup = BeautifulSoup(html_page, "html5lib")

# Print Scraped 'HTML' Code
print(soup.prettify())

# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

Hope this Solution helps you.

Answered By - Jay Patel

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 12, 2022

[FIXED] How to read the source HTLM code from a locally saved HTML file using Python?

Issue

Solution

Approach 1 (By using `soup.prettify()` in your Current `Code`):-

Approach 2 (By using `HTML Parser` and `soup.prettify()`):-

0 comments:

Post a Comment

Popular Posts

Labels

Wednesday, January 12, 2022

Issue

Solution

Approach 1 (By using soup.prettify() in your Current Code):-

Approach 2 (By using HTML Parser and soup.prettify()):-

0 comments:

Post a Comment

Popular Posts

Labels

Approach 1 (By using `soup.prettify()` in your Current `Code`):-

Approach 2 (By using `HTML Parser` and `soup.prettify()`):-