Issue
I have a piece of code here to extract div.statement.p.text
a = """<div class="theorem" id="theorem-AA" acro="AA" titletext="Adjoint of an Adjoint"> <h5 class="theorem"> <span class="type">Theorem </span><span class="acro">AA</span><span class="titletext"> Adjoint of an Adjoint</span> </h5> <div class="statement"><p>Suppose that $A$ is a matrix. Then $\adjoint{\left(\adjoint{A}\right)}=A$.</p></div> <div class="proof"><a knowl="./knowls/proof.AA.knowl">Proof</a></div> </div><div class="context"><a href="http://linear.pugetsound.edu/html/section-MO.html#theorem-AA" class="context" title="Section MO">(in context)</a></div> """
from bs4 import BeautifulSoup as bs
soup = bs(repr(a),features = 'lxml')
statement = bs(repr(soup.find_all("div", {"class": "statement"})[0])).find('p').text
print(statement)
The output came was
Suppose that $A$ is a matrix. Then $\x07djoint{\\left(\x07djoint{A}\right)}=A$.
I need the output to be:
Suppose that $A$ is a matrix. Then $\adjoint{\left(\adjoint{A}\right)}=A$.
How can I do this?
Solution
The problem is with your string, as \a
is the bell character.
Having said that, this basically introduces non-printable characters into your string literal.
So, either escape the \
with \\
or just add r
to your string and then process the escape sequence like this:
from bs4 import BeautifulSoup as bs
a = r"""<div class="theorem" id="theorem-AA" acro="AA" titletext="Adjoint of an Adjoint"> <h5 class="theorem"> <span class="type">Theorem </span><span class="acro">AA</span><span class="titletext"> Adjoint of an Adjoint</span> </h5> <div class="statement"><p>Suppose that $A$ is a matrix. Then $\adjoint{\left(\adjoint{A}\right)}=A$.</p></div> <div class="proof"><a knowl="./knowls/proof.AA.knowl">Proof</a></div> </div><div class="context"><a href="http://linear.pugetsound.edu/html/section-MO.html#theorem-AA" class="context" title="Section MO">(in context)</a></div> """
soup = bs(repr(a), "html.parser")
statement = soup.find_all("div", {"class": "statement"})[0].find('p').getText()
print(bytes(statement, "utf-8").decode("unicode_escape"))
Output:
Suppose that $A$ is a matrix. Then $\adjoint{\left(\adjoint{A}\right)}=A$.
Answered By - baduker
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.