Issue
How can I extract data from example HTML with beautifulsoup
?
<Tag1>
<message code="able to extract text from here"/>
<text value="able to extract text that is here"/>
<htmlText><![CDATA[<p>some thing <lite>OR</lite>get exact data from here</p>]]></htmlText>
</Tag1>
I tried both .findall
and .get_text
, however I am not able to extract the text values from htmlText
element.
Expected output:
some thing ORget exact data from here
Solution
Here's are the steps you need to make:
# firstly, select all "htmlText" elements
soup.select("htmlText")
# secondly, iterate over all of them
for result in soup.select("htmlText"):
# further code
# thirdly, use another BeautifulSoup() object to parse the data
# otherwise you can't access <p>, <lite> elements data
# since they are unreachable to first BeautifulSoup() object
for result in soup.select("htmlText"):
final = BeautifulSoup(result.text, "lxml")
# fourthly, grab all <p> elements AND their .text -> "p.text"
for result in soup.select("htmlText"):
final = BeautifulSoup(result.text, "lxml").p.text
Code and example in the online IDE (use the most readable):
from bs4 import BeautifulSoup
import lxml
html = """
<Tag1>
<message code="able to extract text from here"/>
<text value="able to extract text that is here"/>
<htmlText><![CDATA[<p>some thing <lite>OR</lite>get exact data from here</p>]]></htmlText>
</Tag1>
"""
soup = BeautifulSoup(html, "lxml")
# BeautifulSoup inside BeautifulSoup
unreadable_soup = BeautifulSoup(BeautifulSoup(html, "lxml").select_one('htmlText').text, "lxml").p.text
print(unreadable_soup)
example_1 = BeautifulSoup(soup.select_one('htmlText').text, "lxml").p.text
print(text_1)
# wihtout hardcoded list slices
for result in soup.select("htmlText"):
example_2 = BeautifulSoup(result.text, "lxml").p.text
print(example_2)
# or one liner
example_3 = ''.join([BeautifulSoup(result.text, "lxml").p.text for result in soup.select("htmlText")])
print(example_3)
# output
'''
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
some thing ORget exact data from here
'''
Answered By - Dimitry Zub
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.