Issue
I am using Python 3.8 with BeautifulSoup4. I am on Windows 10 and I use PyCharm.
I am kinda new with this lib but I was able to manage simple extractions. However, I have this HTML code (which I didn't make and which I cannot edit) :
<ul>
<li>
<span class="def">Achenheim</span> (Région de Mundolsheim, Bas-Rhin)
<ul>
<li>
<ul>
<li>
<a class="tdme" href="orgues/achenhei.htm">>
St-Georges : Max ROETHINGER, 1962.</a>
</li>
</ul>
</li>
</ul>
</li>
<li>
<span class="def">Adamswiller</span> (Région de Drulingen, Bas-Rhin)<ul>
<li>
<ul>
<li>
<a class="tdme" href="orgues/adamswpr.htm">>
Eglise protestante : George WEGMANN, 1846.</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
So far, I was able to grab the values "Achenheim" (in the span tag), and "St-Georges : Max ROETHINGER, 1962." (in the a tag).
I would like to know if it's possible to grab the following values:
(Région de Mundolsheim, Bas-Rhin)
I struggle because it's not really inside any specific tag, beside a li tag. But when I try to grab the text value of the li tag, I get this :
<li>
<span class="def">Achenheim</span> (Région de Mundolsheim, Bas-Rhin)<ul>
<li>
<ul>
<li>
<a class="tdme" href="orgues/achenhei.htm">>
St-Georges : Max ROETHINGER, 1962.</a>
</li>
</ul>
</li>
</ul>
</li>
My code is this:
from bs4 import BeautifulSoup
import requests as requests
r = requests.get('url_link')
soup = BeautifulSoup(r.content, 'html.parser')
regions = soup.select('ul li')
for r in regions:
print(str(r))
It's not what I was expecting :( Anyone knows how to grab a data that it outside a specific tag please? Again, I am trying to get :
(Région de Mundolsheim, Bas-Rhin)
Triming and slicing cannot be a solution in my case :/
Solution
You could change your strategy selecting the elements and use the <span>
instead:
for e in soup.select('ul span'):
print(e.text)
print(e.next_sibling.strip())
print(' '.join(e.find_next('a').text.split()))
Example
from bs4 import BeautifulSoup
html = '''
<ul>
<li>
<span class="def">Achenheim</span> (Région de Mundolsheim, Bas-Rhin)
<ul>
<li>
<ul>
<li>
<a class="tdme" href="orgues/achenhei.htm">>
St-Georges : Max ROETHINGER, 1962.</a>
</li>
</ul>
</li>
</ul>
</li>
<li>
<span class="def">Adamswiller</span> (Région de Drulingen, Bas-Rhin)<ul>
<li>
<ul>
<li>
<a class="tdme" href="orgues/adamswpr.htm">>
Eglise protestante : George WEGMANN, 1846.</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
'''
soup = BeautifulSoup(html)
for e in soup.select('ul span'):
print(e.text)
print(e.next_sibling.strip())
print(' '.join(e.find_next('a').text.split()))
Output
Achenheim
(Région de Mundolsheim, Bas-Rhin)
> St-Georges : Max ROETHINGER, 1962.
Adamswiller
(Région de Drulingen, Bas-Rhin)
> Eglise protestante : George WEGMANN, 1846.
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.