Issue
I'm new to HTML and beautiful soup. I am trying to read a locally saved HTML file in Python and I tested the following code:
with open(file_path) as fp:
soup = BeautifulSoup(fp)
print(soup)
The output looks weird and here is a part of it:
<html><body><p>ÿþh t m l >
h e a d >
m e t a h t t p - e q u i v = C o n t e n t - T y p e c o n t e n t = " t e x t / h t m l ; c h a r s e t = u n i c o d e " >
m e t a n a m e = G e n e r a t o r c o n t e n t = " M i c r o s o f t W o r d 1 5 ( f i l t e r e d ) " >
s t y l e >
! - -
/ * F o n t D e f i n i t i o n s * /
The original HTML code is something like
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=unicode">
<meta name=Generator content="Microsoft Word 15 (filtered)">
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
Can anyone help me or share some thoughts?
Thank you!
Solution
First of all, let's discuss why you are not able to fetch
desired Output
. It is because when you parsing
data in BeautifulSoup
. There might be some Spaces, Symbols, etc. presented in your Code
. So, the appropriate Solution
for this scenario was stated below:-
- Needed Solution:- Use
soup.prettify()
- Appropriate Solution:- Use
HTML Parser
andsoup.prettify()
together
To Learn more about
HTML Parser
andsoup.prettify
:- Click Here
Approach 1 (By using soup.prettify()
in your Current Code
):-
# File Path of 'HTML' File
file_path = 'demo.html'
# Fetch 'HTML' Code Using 'BeautifulSoup'
with open(file_path) as fp:
soup = BeautifulSoup(fp)
# Print 'HTML' Code using 'prettify' Format
print(soup.prettify())
# Output of above cell:-
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Microsoft Word 15 (filtered)" name="Generator"/>
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
</style>
</head>
</html>
Approach 2 (By using HTML Parser
and soup.prettify()
):-
# Import all-important Libraries
from bs4 import BeautifulSoup
import html5lib
# Open Our 'HTML' File
html_page = open('demo.html', 'r')
# Parse it to 'HTML' Format
soup = BeautifulSoup(html_page, "html5lib")
# Print Scraped 'HTML' Code
print(soup.prettify())
# Output of above cell:-
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Microsoft Word 15 (filtered)" name="Generator"/>
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
</style>
</head>
</html>
Hope this Solution helps you.
Answered By - Jay Patel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.