Issue
I am trying to fetch a specific group of li nested in ul. Below is my starting code. The data I am trying to fetch is at https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Risikogebiete_neu.html. I highlighted the block of li(s) that I wanted to fetch.
> import requests from bs4 import BeautifulSoup
> # print(soup.prettify())
> page = requests.get('https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Risikogebiete_neu.html').text
>
> soup = BeautifulSoup(page, 'html.parser')
> uls = soup.find_all('ul',id=None)
> mine=[]
> for ul in uls:
> newsoup = BeautifulSoup(str(ul), 'html.parser')
> lis = newsoup.find_all('li',id=None)
> for li in lis:
> mine.append(li.text)
> print(li.text)
Solution
This works:
token = 'Gebiete, die zu einem beliebigen Zeitpunkt in den vergangenen 14 Tagen Risikogebiete waren, aber derzeit KEINE mehr sind:'
no_longer_at_risk = soup.find_all(text=token)[0].findNext('ul').find_all('li')
This requires that the text we’re searching for doesn’t change — even just slightly! You could make it more robust by searching for a regular expression instead.
import re
token = re.compile(r'vergangen.*Risikogebiet.*keine.*mehr', re.I)
no_longer_at_risk = soup.find_all(text=token)[-1].findNext('ul').find_all('li')
But fundamentally the best way would probably be to iterate over all nodes in the document and check which matches the most of a list of tokens (e.g. ['Gebiet', 'Risikogebiet', 'vergangen', 'kein', 'mehr']
).
Answered By - Konrad Rudolph
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.