Saturday, September 17, 2022

[FIXED] Grab data outside a specific tag in Python BeautifulSoup4

September 17, 2022 beautifulsoup, python No comments

Issue

I am using Python 3.8 with BeautifulSoup4. I am on Windows 10 and I use PyCharm.

I am kinda new with this lib but I was able to manage simple extractions. However, I have this HTML code (which I didn't make and which I cannot edit) :

<ul>
            <li>
               <span class="def">Achenheim</span> (Région de Mundolsheim, Bas-Rhin)
               <ul>
                  <li>
                     <ul>
                        <li>
                           <a class="tdme" href="orgues/achenhei.htm">&gt;
                                                St-Georges : Max ROETHINGER, 1962.</a>
                        </li>
                     </ul>
                  </li>
               </ul>
            </li>
            <li>
               <span class="def">Adamswiller</span> (Région de Drulingen, Bas-Rhin)<ul>
                  <li>
                     <ul>
                        <li>
                           <a class="tdme" href="orgues/adamswpr.htm">&gt;
                                                Eglise protestante : George WEGMANN, 1846.</a>
                        </li>
                     </ul>
                  </li>
               </ul>
            </li>              
</ul>

So far, I was able to grab the values "Achenheim" (in the span tag), and "St-Georges : Max ROETHINGER, 1962." (in the a tag).

I would like to know if it's possible to grab the following values:

(Région de Mundolsheim, Bas-Rhin)

I struggle because it's not really inside any specific tag, beside a li tag. But when I try to grab the text value of the li tag, I get this :

<li>
<span class="def">Achenheim</span> (Région de Mundolsheim, Bas-Rhin)<ul>
<li>
<ul>
<li>
<a class="tdme" href="orgues/achenhei.htm">&gt;
                                                St-Georges : Max ROETHINGER, 1962.</a>
</li>
</ul>
</li>
</ul>
</li>

My code is this:

from bs4 import BeautifulSoup
import requests as requests

r = requests.get('url_link')
soup = BeautifulSoup(r.content, 'html.parser')

regions = soup.select('ul li')
for r in regions:
    print(str(r))

It's not what I was expecting :( Anyone knows how to grab a data that it outside a specific tag please? Again, I am trying to get :

(Région de Mundolsheim, Bas-Rhin)

Triming and slicing cannot be a solution in my case :/

Solution

You could change your strategy selecting the elements and use the <span> instead:

for e in soup.select('ul span'):
    print(e.text)
    print(e.next_sibling.strip())
    print(' '.join(e.find_next('a').text.split()))

Example

from bs4 import BeautifulSoup

html = '''
<ul>
            <li>
               <span class="def">Achenheim</span> (Région de Mundolsheim, Bas-Rhin)
               <ul>
                  <li>
                     <ul>
                        <li>
                           <a class="tdme" href="orgues/achenhei.htm">&gt;
                                                St-Georges : Max ROETHINGER, 1962.</a>
                        </li>
                     </ul>
                  </li>
               </ul>
            </li>
            <li>
               <span class="def">Adamswiller</span> (Région de Drulingen, Bas-Rhin)<ul>
                  <li>
                     <ul>
                        <li>
                           <a class="tdme" href="orgues/adamswpr.htm">&gt;
                                                Eglise protestante : George WEGMANN, 1846.</a>
                        </li>
                     </ul>
                  </li>
               </ul>
            </li>              
</ul>
'''
soup = BeautifulSoup(html)

for e in soup.select('ul span'):
    print(e.text)
    print(e.next_sibling.strip())
    print(' '.join(e.find_next('a').text.split()))

Output

Achenheim
(Région de Mundolsheim, Bas-Rhin)
> St-Georges : Max ROETHINGER, 1962.

Adamswiller
(Région de Drulingen, Bas-Rhin)
> Eglise protestante : George WEGMANN, 1846.

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, September 17, 2022

[FIXED] Grab data outside a specific tag in Python BeautifulSoup4

Issue

Solution

Example

Output

0 comments:

Post a Comment

Popular Posts

Labels