Issue
I have a html with multiple elements of the same kind. I need to remove everything after the last element of its kind.
html = '''
<p>Some text element.</p>
<p>Some other text element.</p>
<p class="myclass">This is an element with class</p>
<p>This is an element without class.</p>
<p>Other paragraph.</p>
<p class="myclass">The second element with class.</p>
<p>Another paragraph.</p>
<p>More</p>
<p>...</p>
'''
And I manage to select the last element of class, but I have no idea how to select everything after my variable. Found no informations about removing regexed variable.
from bs4 import BeautifulSoup
import lxml
soup = BeautifulSoup(data, 'lxml')
# Selecting all elements with class
ps_with_class = soup.find_all('p',{'class':'myclass'}
# if elements exist
if ps_with_class:
# Selecting last element
last_p_with_class = ps_with_class[-1]
# How to remove something like r"last_p_with_class*" from html? maybe using /import re/
If I can remove everything after the second element with class "myclass", the output should then be:
<p>Some text element.</p>
<p>Some other text element.</p>
<p class="myclass">This is an element with class</p>
<p>This is an element without class.</p>
<p>Other paragraph.</p>
<p class="myclass">The second element with class.</p>
Solution
You could use a combination of .next_sibling
and extract()
to remove all elements following the second matching <p>
.
For example:
from bs4 import BeautifulSoup
import lxml
html = '''
<p>Some text element.</p>
<p>Some other text element.</p>
<p class="myclass">This is an element with class</p>
<p>This is an element without class.</p>
<p>Other paragraph.</p>
<p class="myclass">The second element with class.</p>
<p>Another paragraph.</p>
<p>More</p>
<p>...</p>
'''
soup = BeautifulSoup(html, 'lxml')
second = soup.find_all('p', {'class':'myclass'})[1]
sibling = second.next_sibling
while sibling:
next_sibling = sibling.next_sibling
sibling.extract()
sibling = next_sibling
print(soup)
This would produce an updated HTML as:
<html><body><p>Some text element.</p>
<p>Some other text element.</p>
<p class="myclass">This is an element with class</p>
<p>This is an element without class.</p>
<p>Other paragraph.</p>
<p class="myclass">The second element with class.</p></body></html>
Answered By - Martin Evans
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.