Issue
I use BeautifulSoup 3.2.1 to parse a lot of HTML files translated with eTranslation.
I found
soup = BeautifulSoup(html_file, "html.parser")
sometimes cuts a section of my HTML file. And it is related to invalid tags or problems found in the HTML.
Also I found soup = BeautifulSoup(html_file, "lxml")
works better in these cases of bad written HTML.
Is there a way to detect which HTML file is invalid using BeautifulSoup?
I image something like this:
if valid(html_file):
soup = BeautifulSoup(html_file, "html.parser")
else:
soup = BeautifulSoup(html_file, "lxml")
Solution
I solved it using lxml all the time.
Answered By - GhitaB
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.