Issue
If I have a html as follows, and I use beautiful soup to parse it, how can I access the lines before <head>
element.
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
For instance the standard way to access the head element is soup.head
or the body is soup.body
. I assume that's because head and body are both standard tags.
Is there a way to access elements before <head>
?
Solution
You can by selecting the head tag and looping over previous_elements:
from bs4 import BeautifulSoup
from w3lib.html import remove_tags
html= '<?xml version="1.0" encoding="utf-8" standalone="no"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>'
soup= BeautifulSoup(html,"html.parser")
x= soup.head
while x.previous_element != None:
if not isinstance(x.previous_element, bs4.element.Tag):
p = x.previous_element.PREFIX + str(x.previous_element) +
x.previous_element.SUFFIX
prev_head = prev_head + p
else:
prev_head = str(x.previous_element) + prev_head
x = x.previous_element
prev_head = remove_tags(prev_head, which_ones= ("head",))
BeautifulSoup(prev_head)
After this process you will have all the code above <head>
in prev_head
as a string
.You can then BeautifulSoup(prev_head)
to get a BS object for posterior use.
PS:
Notice that I've deleted the <head>
tag because <html>
is the first previous_element
. I've also formatted the non-tag elements because their flat str format doesn't include their prefix and suffix making them unavailable to use in a BS object.
Answered By - A.Lorefice
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.