Issue
Here's a sample of what I'm scraping:
<p><strong>Title 1</strong>
<br />
lorem ipsum 1</p>
<p>lorem ipsum 2</p>
…
<p>lorem ipsum n</p>
<p><strong>Title 2</strong>
<br />
blah blah </p>
I would like all text (no tags) starting after <strong>Title 1</strong>
up to, and not including <strong>Title 2</strong>
.
I would like to be returned: "lorem ipsum 1 lorem ipsum 2 lorem ipsum n"
Here is what I tried:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find the <strong> tag with the specified text in the section argument
strong_tag = soup.find('strong', string="Title 1")
print("TAG", strong_tag)
if strong_tag:
# Retrieve all text following the <strong> tag until the next <strong> tag
section_text = ''
next_sibling = strong_tag.next_sibling
print("NEXT SIBLING", next_sibling)
while next_sibling:
if next_sibling.string and next_sibling.name != 'strong':
section_text += next_sibling.string.strip() + ' '
print("SECTION TEXT", section_text)
next_sibling = next_sibling.next_sibling
else:
break
if not section_text:
next_tag = strong_tag.find_next()
print("FIND_NEXT", next_tag)
while next_tag and next_tag.name != 'strong':
if next_tag.string:
print("FIND_NEXT.STRING", next_tag.string)
section_text += next_tag.string.strip() + ' '
next_tag = next_tag.find_next()
return section_text.strip()
else:
print(f"Section '{section}' not found.")
return None
This returns "lorem ipsum 2 lorem ipsum n" but not "lorem ipsum 1".
So I try this:
strong_tag = soup.find('strong', string="Title 1")
if strong_tag:
# Retrieve all text until the next <strong> tag, regardless of its position
section_text = ''
print("TAG", strong_tag)
while strong_tag:
if strong_tag.string:
# Append text
section_text += strong_tag.string.strip() + ' '
next_item = strong_tag.next_sibling
print("NEXTITEM", next_item)
while next_item and not hasattr(next_item, 'name') and not isinstance(next_item, str):
# Append text nodes not wrapped in tags
section_text += next_item.string.strip() + ' '
next_item = next_item.next_sibling
if not next_item:
# Stop if there is no next sibling
break
if next_item.name == 'strong':
# Stop if next tag is a <strong> tag
break
strong_tag = next_item
return section_text.strip()
else:
print(f"Section '{section}' not found.")
return None
Which returns "lorem ipsum 1" only.
How do I modify the code so that I retrieve all text from one element to the next, sequentially, whether wrapped in a tag or not, regardless of sibling, parent, or child?.
Solution
One possible solution is to use .find_previous()
on NavigableString
:
from bs4 import BeautifulSoup
html_doc = '''\
<p><strong>Title 1</strong>
<br />
lorem ipsum 1</p>
<p>lorem ipsum 2</p>
<p>lorem ipsum n</p>
<p><strong>Title 2</strong>
<br />
blah blah </p>'''
soup = BeautifulSoup(html_doc, 'html.parser')
text = []
for t in soup.find_all(string=True):
prev = t.find_previous('strong')
if prev and 'Title 1' in prev.text and t.strip():
text.append(t.strip())
print(text[1:])
Prints:
['lorem ipsum 1', 'lorem ipsum 2', 'lorem ipsum n']
Answered By - Andrej Kesely
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.