Thursday, April 21, 2022

[FIXED] Get tag name and text with Soup

April 21, 2022 beautifulsoup, python, web-crawler No comments

Issue

I try to get the text from a header with BeautifulSoup. The header is dynamic in his attributes. That is the reason, why I minimize the html-code. I want to get the text and the according tag-name. I tried this:

 for element in header.find_all(string = True):
                
                if len(element.text.strip('\n')) > 0:
                    print(element)
                    a = header.find_all(text = element.text)
                    print(a + " " + a.text.strip('\n'))

But the print output is only the text and not the tag with text. How can I solve that?

Thank you in forward.

First:

<header>

    <div>
        <h2>
            <span>
                Text A here</span>
            <span><span>Text B Here <span>
                    </span>
        </h2>
        <div>
            <div>
                <span>
                    <div>
                    </div>
                </span>
            </div>
            <div>
                Text C Here
                <a>
                    Text C Here</a>
            </div>
        </div>
        <div>

            <a>
                Text D</a>
        </div>
        <div>
            Text E
        </div>
    </div>
    </div>
</header>

Second Example:

<header>
    <div>
        <div>
            <h2>
                <span>
                    Text A
                </span>
                <span><span>Text B</span>
                </span>
            </h2>
            <div>
                Text C
            </div>
        </div>
    </div>
</header>

3th Example:

<header>
    <div>
        <div>
            <h2>
                <span>
                    Text A
                </span>
                <span>
                    <span>
                        <span><svg>
                                <g>
                                    <rect></rect>
                                    <path></path>
                                    <path></path>
                                </g>
                            </svg>
                        </span>
                    </span><span>Text B</span>
                </span>
            </h2>
            <div>
                Text C
            </div>
            <div>
                Text D
                <a>
                    Text E</a><span>Text F </span>
                <a>
                    Text G</a><span> Text H </span>
                <a>
                    Text I</a>
            </div>
        </div>
    </div>
</header>

Solution

Use element.parent.name to get the tag name.

Based on your example 1:

    from bs4 import BeautifulSoup

    html='''<header>
    
        <div>
            <h2>
                <span>
                    Text A here</span>
                <span><span>Text B Here <span>
                        </span>
            </h2>
            <div>
                <div>
                    <span>
                        <div>
                        </div>
                    </span>
                </div>
                <div>
                    Text C Here
                    <a>
                        Text C Here</a>
                </div>
            </div>
            <div>
    
                <a>
                    Text D</a>
            </div>
            <div>
                Text E
            </div>
        </div>
        </div>
    </header>'''
    
    header=BeautifulSoup(html.strip(), 'html.parser')
    for element in header.find_all(string = True):
                    
        if len(element.text.strip('\n')) > 0:
            print(element.strip())
            a = header.find(text = element.text)
            print(a.parent.name + " " + a.text.strip())

Output:

Text A here
span Text A here
Text B Here
span Text B Here
Text C Here
div Text C Here
Text C Here
a Text C Here
Text D
a Text D
Text E
div Text E

Example 3:

html='''<header>
    <div>
        <div>
            <h2>
                <span>
                    Text A
                </span>
                <span>
                    <span>
                        <span><svg>
                                <g>
                                    <rect></rect>
                                    <path></path>
                                    <path></path>
                                </g>
                            </svg>
                        </span>
                    </span><span>Text B</span>
                </span>
            </h2>
            <div>
                Text C
            </div>
            <div>
                Text D
                <a>
                    Text E</a><span>Text F </span>
                <a>
                    Text G</a><span> Text H </span>
                <a>
                    Text I</a>
            </div>
        </div>
    </div>
</header>'''

header=BeautifulSoup(html.strip(), 'html.parser')
for element in header.find_all(string = True):
                
    if len(element.text.strip('\n')) > 0:
        print(element.strip())
        a = header.find(text = element.text)
        print(a.parent.name + " " + a.text.strip())

Output:

Text A
span Text A
Text B
span Text B
Text C
div Text C
Text D
div Text D
Text E
a Text E
Text F
span Text F
Text G
a Text G
Text H
span Text H
Text I
a Text I

Answered By - KunduK

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, April 21, 2022

[FIXED] Get tag name and text with Soup

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels