Wednesday, January 12, 2022

[FIXED] python beautiful soup html tag question [updated]

January 12, 2022 beautifulsoup, python No comments

Issue

I have the following lines in md files

<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>

<td class="IDtd">
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a>
</td>

[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings

<td class="IDtd">Some_text EEE-123411 Other text</td>

My questions are:

How can I check using beautiful soup the the next line after TD is html tag or text?
How can I add html commecnt in all links (html and md) following with an ID?

The expected output for the 2nd question is

<td colspan="1" class="IDtd">
<p>
<!-- 
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> 
--> #ID - <span>number of total submissions</span>
</p>
</td>

<td class="IDtd">
<!--
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a> 
--> #ID
</td>

<!--
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
--> #ID

For the first question I found this

html = """
<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://_jira_link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>

<td class="IDtd">
<a href="https://_jira_link/jira/browse/EEEE-2543" class="external-link" rel="nofollow">https://_jira_link/browse/EEEEE-2543</a>
</td>

 """
soup = BeautifulSoup(html)
tds = soup.find_all("td", {"class":"IDtd"})
for td in tds:
     p = td.find_all("p") # you get list
     if p:
         a = soup.find_all("a")
         if a:
             print("Anchor text is: " + a[0].get_text())
             continue
         print("P text is: " + p[0].get_text())
         continue
     else:
         print("No P and A tags found")

Thank you in advanced

Solution

Your first question, how to find out what follows a certain tag could be done by using the next_element function, something like this:

from bs4 import BeautifulSoup, Comment

html = """<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>

<td class="IDtd">
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a>
</td>

[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings

<td class="IDtd">Some_text EEE-123411 Other text</td>"""

soup = BeautifulSoup(html, "html.parser")
element = soup.td

for _ in range(5):
    element = element.next_element
    print(type(element), element.name)

This shows you the type and name of the next five elements that follow the <td> tag:

<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> p
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> a
<class 'bs4.element.NavigableString'> None

As you can see, the next element is actually a string (which contains the newline), then followed by the <p> tag.

For your second question, you can insert or extract tags as needed using BeautifulSoup. First iterate over all of the required <a> tags, then create a Comment tag with the contents being the <a> tag. This can then be inserted before the tag. Finally remove the existing <a> tag:

from bs4 import BeautifulSoup, Comment

html = """<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>

<td class="IDtd">
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a>
</td>

[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings

<td class="IDtd">Some_text EEE-123411 Other text</td>"""

soup = BeautifulSoup(html, "html.parser")

for td in soup.find_all('td', class_="IDtd"):
    for a_tag in td.find_all('a'):
        a_tag.insert_before(Comment(f'\n{a_tag}\n'))
        a_tag.extract()

print(soup)

The updated HTML would be:

<td class="IDtd" colspan="1">
<p>
<!--
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a>
--> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<!--
<a class="external-link" href="https://link/browse/EEEE-2543" rel="nofollow">https://link/browse/EEEEE-2543</a>
-->
</td>

[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings

<td class="IDtd">Some_text EEE-123411 Other text</td>

Answered By - Martin Evans

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, January 12, 2022

[FIXED] python beautiful soup html tag question [updated]

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels