Issue
I try to retrieve information from EZB press releases. To do so I use BeautifulSoup. Since the structure (HTML) of the press releases is changing over time, it is difficult to retrieve the date of the press releases with a single selector. Hence I tried to use "try and except" as well as "if/else statements" to retrieve the date from all HTML files. Unfortunately, my code does not work the way I want it to work since I do not get the adequate dates from all press releases.
Does anybody know how to iterate through multiple soup elements and choose the right element to select the date from the respective HTML file?
Attached my code:
from pandas.core.internals.managers import ensure_block_shape
import bs4, requests
pr_list = []
def parseContent(Urls):
for x in Urls:
res = requests.get(x)
article = bs4.BeautifulSoup(res.text, 'html.parser')
try:
date = article.select('#main-wrapper > main > div.section > p.ecb-publicationDate')
if date:
for x in date:
date = x.text.strip()
date = article.select('#main-wrapper > main > div.ecb-pressContentPubDate')
if date:
for x in date:
date = x.text.strip()
else:
date = article.select('#main-wrapper > main > div.title > ul > li.ecb-publicationDate')
for x in date:
date = x.text.strip()
except:
date = None
try:
title = article.select('#main-wrapper > main > div.title > h1')
for x in title:
title = x.text.strip()
except:
title = None
try:
body = article.select("#main-wrapper > main > div.section")
for x in body:
body = x.text.strip()
except:
body = None
row = [date,title,body]
pr_list.append(row)
Solution
Store your match expressions in a list and then iterate over them until one is successful:
import bs4
import requests
date_expressions = [
"#main-wrapper > main > div.section > p.ecb-publicationDate",
"#main-wrapper > main > div.ecb-pressContentPubDate",
"#main-wrapper > main > div.title > ul > li.ecb-publicationDate",
]
title_expressions = [
"#main-wrapper > main > div.title > h1",
]
body_expressions = [
"#main-wrapper > main > div.section",
]
def try_several_expressions(article, expressions):
"""Try to match an element using the given list of expressions.
Raise ValueError if we failed to find any matches or if we find
multiple matches.
"""
for expr in expressions:
res = article.select(expr)
if res:
break
else:
raise ValueError("failed to match any expressions")
if len(res) > 1:
raise ValueError("failed to match a unique value")
return res[0]
def parseContent(urls):
pr_list = []
for url in urls:
res = requests.get(url)
article = bs4.BeautifulSoup(res.text, "html.parser")
date = try_several_expressions(article, date_expressions).text
title = try_several_expressions(article, title_expressions).text
body = try_several_expressions(article, body_expressions).text
row = [date, title, body]
pr_list.append(row)
return pr_list
Assuming that you mean "ECB" rather than "EZB", I tested this against https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html and it seems to work as expected.
If I make the one change I suggested in my comment (remove the if len(res) > 1
check), so that try_several_expressions
looks like this:
def try_several_expressions(article, expressions):
"""Try to match an element using the given list of expressions.
Raise ValueError if we failed to find any matches or if we find
multiple matches.
"""
for expr in expressions:
res = article.select(expr)
if res:
break
else:
raise ValueError("failed to match any expressions")
# Always return the first matched element
return res[0]
Then the script works for every single url in your list except for https://www.ecb.europa.eu/press/pr/date/2020/html/ecb.pr2002242~8842dcb418.en.html, which doesn't have any content.
If you put a try/except
block in parseContent
, you can simply ignore that failure:
def parseContent(urls):
pr_list = []
for url in urls:
res = requests.get(url)
article = bs4.BeautifulSoup(res.text, "html.parser")
try:
date = try_several_expressions(article, date_expressions).text.strip()
title = try_several_expressions(article, title_expressions).text.strip()
body = try_several_expressions(article, body_expressions).text
except ValueError:
print(f'failed to parse: {url}')
continue
row = [date, title, body]
pr_list.append(row)
return pr_list
Answered By - larsks
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.