Issue
I have the following lines in md files
<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a>
</td>
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
<td class="IDtd">Some_text EEE-123411 Other text</td>
My questions are:
- How can I check using beautiful soup the the next line after TD is html tag or text?
- How can I add html commecnt in all links (html and md) following with an ID?
The expected output for the 2nd question is
<td colspan="1" class="IDtd">
<p>
<!--
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a>
--> #ID - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<!--
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a>
--> #ID
</td>
<!--
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
--> #ID
For the first question I found this
html = """
<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://_jira_link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<a href="https://_jira_link/jira/browse/EEEE-2543" class="external-link" rel="nofollow">https://_jira_link/browse/EEEEE-2543</a>
</td>
"""
soup = BeautifulSoup(html)
tds = soup.find_all("td", {"class":"IDtd"})
for td in tds:
p = td.find_all("p") # you get list
if p:
a = soup.find_all("a")
if a:
print("Anchor text is: " + a[0].get_text())
continue
print("P text is: " + p[0].get_text())
continue
else:
print("No P and A tags found")
Thank you in advanced
Solution
Your first question, how to find out what follows a certain tag could be done by using the next_element
function, something like this:
from bs4 import BeautifulSoup, Comment
html = """<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a>
</td>
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
<td class="IDtd">Some_text EEE-123411 Other text</td>"""
soup = BeautifulSoup(html, "html.parser")
element = soup.td
for _ in range(5):
element = element.next_element
print(type(element), element.name)
This shows you the type and name of the next five elements that follow the <td>
tag:
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> p
<class 'bs4.element.NavigableString'> None
<class 'bs4.element.Tag'> a
<class 'bs4.element.NavigableString'> None
As you can see, the next element is actually a string (which contains the newline), then followed by the <p>
tag.
For your second question, you can insert or extract tags as needed using BeautifulSoup. First iterate over all of the required <a>
tags,
then create a Comment
tag with the contents being the <a>
tag. This can then be inserted before the tag. Finally remove the existing <a>
tag:
from bs4 import BeautifulSoup, Comment
html = """<td colspan="1" class="IDtd">
<p>
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<a href="https://link/browse/EEEE-2543" class="external-link" rel="nofollow">https://link/browse/EEEEE-2543</a>
</td>
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
<td class="IDtd">Some_text EEE-123411 Other text</td>"""
soup = BeautifulSoup(html, "html.parser")
for td in soup.find_all('td', class_="IDtd"):
for a_tag in td.find_all('a'):
a_tag.insert_before(Comment(f'\n{a_tag}\n'))
a_tag.extract()
print(soup)
The updated HTML would be:
<td class="IDtd" colspan="1">
<p>
<!--
<a class="external-link" href="https://link/browse/DDDD-3194" rel="nofollow">DDDD-3194</a>
--> - <span>number of total submissions</span>
</p>
</td>
<td class="IDtd">
<!--
<a class="external-link" href="https://link/browse/EEEE-2543" rel="nofollow">https://link/browse/EEEEE-2543</a>
-->
</td>
[AAAA-4444](https://link/browse/AAAA-4444) - BO NANO : UAT Findings
<td class="IDtd">Some_text EEE-123411 Other text</td>
Answered By - Martin Evans
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.