Issue
It turns out that the tag name should be: "ix:nonfraction"
This does not work. No "xi" tag is found.
from bs4 import BeautifulSoup
text = """
<td style="BORDER-BOTTOM:0.75pt solid #7f7f7f;white-space:nowrap;vertical-align:bottom;text-align:right;">$ <ix:nonfraction name="ecd:AveragePrice" contextref="P01_01_2022To12_31_2022" unitref="Unit_USD" decimals="2" scale="0" format="ixt:num-dot-decimal">97.88</ix:nonfraction>
</td>
"""
soup = BeautifulSoup(text, 'lxml')
print(soup)
ix_tags = soup.find_all('ix')
print(ix_tags)
But the following works. I don't see a difference. Why is it? Thanks a lot!
html_content = """
<html>
<body>
<ix>Tag 1</ix>
<ix>Tag 2</ix>
<ix>Tag 3</ix>
<p>Not an ix tag</p>
</body>
</html>
"""
soup = BeautifulSoup(html_content, 'lxml')
ix_tags = soup.find_all('ix')
for tag in ix_tags:
print(tag.text)
Solution
The issue here arises from how BeautifulSoup handles namespaced tags like <ix:nonfraction>
. With the lxml
parser, namespaced tags might not be correctly parsed or recognized.
In the XML you provided, ix
is the namespace, and nonfraction
is the local name of the element. In XML, a namespace is a method to avoid name conflicts by differentiating elements or attributes within XML documents. The ix:nonfraction
tag indicates that the nonfraction
element is part of the ix
namespace.
To correctly find namespaced tags like <ix:nonfraction>
when using the lxml
parser, you should use the exact tag name in your find_all
call:
ix_tags = soup.find_all('ix:nonfraction')
If you want to find the tags without providing the namespace, then you can use the xml
parser which handles namespaced tags much more gracefully.
soup = BeautifulSoup(text, 'xml')
ix_tags = soup.find_all('nonfraction')
Sample run:
from bs4 import BeautifulSoup
text = """
<td style="BORDER-BOTTOM:0.75pt solid #7f7f7f;white-space:nowrap;vertical-align:bottom;text-align:right;">$ <ix:nonfraction name="ecd:AveragePrice" contextref="P01_01_2022To12_31_2022" unitref="Unit_USD" decimals="2" scale="0" format="ixt:num-dot-decimal">97.88</ix:nonfraction>
</td>
"""
soup = BeautifulSoup(text, 'lxml')
ix_tags = soup.find_all('ix:nonfraction')
print(ix_tags)
soup = BeautifulSoup(text, 'xml')
ix_tags = soup.find_all('nonfraction')
print(ix_tags)
Output:
[<ix:nonfraction contextref="P01_01_2022To12_31_2022" decimals="2" format="ixt:num-dot-decimal" name="ecd:AveragePrice" scale="0" unitref="Unit_USD">97.88</ix:nonfraction>]
[<nonfraction contextref="P01_01_2022To12_31_2022" decimals="2" format="ixt:num-dot-decimal" name="ecd:AveragePrice" scale="0" unitref="Unit_USD">97.88</nonfraction>]
Answered By - Bilesh Ganguly
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.