Issue
Other questions with similar titles did not answer my question.
If I execute this:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html><body><p>111</p><p>before<ul><li>222</li></ul>after</p></body></html>", "lxml")
soup.find_all(["p", "li"])
I get this result:
[<p>111</p>, <p>before</p>, <li>222</li>]
I expected to find "after" in the result as well, either as part of the second "p" element or as a 4th item in the list.
Is this expected behaviour? Is there a way to retrieve the text "after"?
More weirdness, if I do print(soup.prettify())
, this is the result.
<html>
<body>
<p>
111
</p>
<p>
before
</p>
<ul>
<li>
222
</li>
</ul>
after
</body>
</html>
The "ul" and "after" are no longer part of the second "p". I assume that the source is not valid html (?), but again:
Is there a way to deal with this, except from just dropping "after"?
Solution
I suggest to use different parser than lxml
in this case: html.parser
. lxml
is more strict than html.parser
:
soup = BeautifulSoup("<html><body><p>111</p><p>before<ul><li>222</li></ul>after</p></body></html>", "html.parser")
print(soup.find_all(["p", "li"]))
Prints:
[<p>111</p>, <p>before<ul><li>222</li></ul>after</p>, <li>222</li>]
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.