Friday, January 5, 2024

[FIXED] How to use bs4 to grab all text, sequentially, whether wrapped in element tag or not, regardless of hierarchical order

January 05, 2024 beautifulsoup, python, python-3.x No comments

Issue

Here's a sample of what I'm scraping:

<p><strong>Title 1</strong>
<br />
lorem ipsum 1</p>
<p>lorem ipsum 2</p>
…
<p>lorem ipsum n</p>

<p><strong>Title 2</strong>
<br />
blah blah </p>

I would like all text (no tags) starting after <strong>Title 1</strong> up to, and not including <strong>Title 2</strong>.

I would like to be returned: "lorem ipsum 1 lorem ipsum 2 lorem ipsum n"

Here is what I tried:

    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find the <strong> tag with the specified text in the section argument
    strong_tag = soup.find('strong', string="Title 1")
    print("TAG", strong_tag)
    if strong_tag:
        # Retrieve all text following the <strong> tag until the next <strong> tag
        section_text = ''
        next_sibling = strong_tag.next_sibling
        print("NEXT SIBLING", next_sibling)
        while next_sibling:
            if next_sibling.string and next_sibling.name != 'strong':
                section_text += next_sibling.string.strip() + ' '
                print("SECTION TEXT", section_text)
                next_sibling = next_sibling.next_sibling
            else:
                break
        
        if not section_text:
            next_tag = strong_tag.find_next()
            print("FIND_NEXT", next_tag)
            while next_tag and next_tag.name != 'strong':
                if next_tag.string:
                    print("FIND_NEXT.STRING", next_tag.string)
                    section_text += next_tag.string.strip() + ' '
                next_tag = next_tag.find_next()
        
        return section_text.strip()
    else:
        print(f"Section '{section}' not found.")
        return None

This returns "lorem ipsum 2 lorem ipsum n" but not "lorem ipsum 1".

So I try this:

    strong_tag = soup.find('strong', string="Title 1")

    if strong_tag:
        # Retrieve all text until the next <strong> tag, regardless of its position
        section_text = ''
        print("TAG", strong_tag)
        while strong_tag:
            if strong_tag.string:
                # Append text
                section_text += strong_tag.string.strip() + ' '
            next_item = strong_tag.next_sibling
            print("NEXTITEM", next_item)
            while next_item and not hasattr(next_item, 'name') and not isinstance(next_item, str):
                # Append text nodes not wrapped in tags
                section_text += next_item.string.strip() + ' '
                next_item = next_item.next_sibling
            if not next_item:
                # Stop if there is no next sibling
                break
            if next_item.name == 'strong':
                # Stop if next tag is a <strong> tag
                break
            strong_tag = next_item

        return section_text.strip()
    else:
        print(f"Section '{section}' not found.")
        return None

Which returns "lorem ipsum 1" only.

How do I modify the code so that I retrieve all text from one element to the next, sequentially, whether wrapped in a tag or not, regardless of sibling, parent, or child?.

Solution

One possible solution is to use .find_previous() on NavigableString:

from bs4 import BeautifulSoup


html_doc = '''\
<p><strong>Title 1</strong>
<br />
lorem ipsum 1</p>
<p>lorem ipsum 2</p>
<p>lorem ipsum n</p>
<p><strong>Title 2</strong>
<br />
blah blah </p>'''

soup = BeautifulSoup(html_doc, 'html.parser')

text = []
for t in soup.find_all(string=True):
    prev = t.find_previous('strong')
    if prev and 'Title 1' in prev.text and t.strip():
        text.append(t.strip())

print(text[1:])

Prints:

['lorem ipsum 1', 'lorem ipsum 2', 'lorem ipsum n']

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 5, 2024

[FIXED] How to use bs4 to grab all text, sequentially, whether wrapped in element tag or not, regardless of hierarchical order

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels