Sunday, January 14, 2024

[FIXED] Why BeautifulSoup can't find this supposed-to-be xbrl related "ix" tag?

January 14, 2024 beautifulsoup, python No comments

Issue

It turns out that the tag name should be: "ix:nonfraction"

This does not work. No "xi" tag is found.

from bs4 import BeautifulSoup

text = """
<td style="BORDER-BOTTOM:0.75pt solid #7f7f7f;white-space:nowrap;vertical-align:bottom;text-align:right;">$ <ix:nonfraction name="ecd:AveragePrice" contextref="P01_01_2022To12_31_2022" unitref="Unit_USD" decimals="2" scale="0" format="ixt:num-dot-decimal">97.88</ix:nonfraction>
</td>
"""

soup = BeautifulSoup(text, 'lxml')
print(soup)
ix_tags = soup.find_all('ix')
print(ix_tags)

But the following works. I don't see a difference. Why is it? Thanks a lot!

html_content = """
<html>
  <body>
    <ix>Tag 1</ix>
    <ix>Tag 2</ix>
    <ix>Tag 3</ix>
    <p>Not an ix tag</p>
  </body>
</html>
"""

soup = BeautifulSoup(html_content, 'lxml')
ix_tags = soup.find_all('ix')
for tag in ix_tags:
    print(tag.text)

Solution

The issue here arises from how BeautifulSoup handles namespaced tags like <ix:nonfraction>. With the lxml parser, namespaced tags might not be correctly parsed or recognized.

In the XML you provided, ix is the namespace, and nonfraction is the local name of the element. In XML, a namespace is a method to avoid name conflicts by differentiating elements or attributes within XML documents. The ix:nonfraction tag indicates that the nonfraction element is part of the ix namespace.

To correctly find namespaced tags like <ix:nonfraction> when using the lxml parser, you should use the exact tag name in your find_all call:

ix_tags = soup.find_all('ix:nonfraction')

If you want to find the tags without providing the namespace, then you can use the xml parser which handles namespaced tags much more gracefully.

soup = BeautifulSoup(text, 'xml')
ix_tags = soup.find_all('nonfraction')

Sample run:

from bs4 import BeautifulSoup

text = """
<td style="BORDER-BOTTOM:0.75pt solid #7f7f7f;white-space:nowrap;vertical-align:bottom;text-align:right;">$ <ix:nonfraction name="ecd:AveragePrice" contextref="P01_01_2022To12_31_2022" unitref="Unit_USD" decimals="2" scale="0" format="ixt:num-dot-decimal">97.88</ix:nonfraction>
</td>
"""

soup = BeautifulSoup(text, 'lxml')
ix_tags = soup.find_all('ix:nonfraction')
print(ix_tags)


soup = BeautifulSoup(text, 'xml')
ix_tags = soup.find_all('nonfraction')
print(ix_tags)

Output:

[<ix:nonfraction contextref="P01_01_2022To12_31_2022" decimals="2" format="ixt:num-dot-decimal" name="ecd:AveragePrice" scale="0" unitref="Unit_USD">97.88</ix:nonfraction>]
[<nonfraction contextref="P01_01_2022To12_31_2022" decimals="2" format="ixt:num-dot-decimal" name="ecd:AveragePrice" scale="0" unitref="Unit_USD">97.88</nonfraction>]

Answered By - Bilesh Ganguly

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 14, 2024

[FIXED] Why BeautifulSoup can't find this supposed-to-be xbrl related "ix" tag?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels