Sunday, January 28, 2024

[FIXED] How to iteratively retrieve the right information from beautiful soup elements?

January 28, 2024 beautifulsoup, parsing, python, text No comments

Issue

I try to retrieve information from EZB press releases. To do so I use BeautifulSoup. Since the structure (HTML) of the press releases is changing over time, it is difficult to retrieve the date of the press releases with a single selector. Hence I tried to use "try and except" as well as "if/else statements" to retrieve the date from all HTML files. Unfortunately, my code does not work the way I want it to work since I do not get the adequate dates from all press releases.

Does anybody know how to iterate through multiple soup elements and choose the right element to select the date from the respective HTML file?

Attached my code:

from pandas.core.internals.managers import ensure_block_shape
import bs4, requests

pr_list = []

def parseContent(Urls):
  for x in Urls:
   res = requests.get(x)
   article = bs4.BeautifulSoup(res.text, 'html.parser')
   try:
    date = article.select('#main-wrapper > main > div.section > p.ecb-publicationDate')
    if date:
      for x in date:
        date = x.text.strip()   
    date = article.select('#main-wrapper > main > div.ecb-pressContentPubDate')
    if date:
      for x in date:
          date = x.text.strip()     
    else:
      date = article.select('#main-wrapper > main > div.title > ul > li.ecb-publicationDate')
      for x in date:
          date = x.text.strip()
   except:
    date = None
   try:
    title = article.select('#main-wrapper > main > div.title > h1')
    for x in title:
      title = x.text.strip()
   except:
    title = None
   try:
    body = article.select("#main-wrapper > main > div.section")
    for x in body:
      body = x.text.strip()
   except:
    body = None
   row = [date,title,body]
   pr_list.append(row)

Solution

Store your match expressions in a list and then iterate over them until one is successful:

import bs4
import requests


date_expressions = [
    "#main-wrapper > main > div.section > p.ecb-publicationDate",
    "#main-wrapper > main > div.ecb-pressContentPubDate",
    "#main-wrapper > main > div.title > ul > li.ecb-publicationDate",
]

title_expressions = [
    "#main-wrapper > main > div.title > h1",
]

body_expressions = [
    "#main-wrapper > main > div.section",
]


def try_several_expressions(article, expressions):
    """Try to match an element using the given list of expressions.

    Raise ValueError if we failed to find any matches or if we find
    multiple matches.
    """

    for expr in expressions:
        res = article.select(expr)
        if res:
            break
    else:
        raise ValueError("failed to match any expressions")

    if len(res) > 1:
        raise ValueError("failed to match a unique value")

    return res[0]


def parseContent(urls):
    pr_list = []
    for url in urls:
        res = requests.get(url)
        article = bs4.BeautifulSoup(res.text, "html.parser")
        date = try_several_expressions(article, date_expressions).text
        title = try_several_expressions(article, title_expressions).text
        body = try_several_expressions(article, body_expressions).text

        row = [date, title, body]
        pr_list.append(row)

    return pr_list

Assuming that you mean "ECB" rather than "EZB", I tested this against https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html and it seems to work as expected.

If I make the one change I suggested in my comment (remove the if len(res) > 1 check), so that try_several_expressions looks like this:

def try_several_expressions(article, expressions):
    """Try to match an element using the given list of expressions.

    Raise ValueError if we failed to find any matches or if we find
    multiple matches.
    """

    for expr in expressions:
        res = article.select(expr)
        if res:
            break
    else:
        raise ValueError("failed to match any expressions")

    # Always return the first matched element
    return res[0]

Then the script works for every single url in your list except for https://www.ecb.europa.eu/press/pr/date/2020/html/ecb.pr2002242~8842dcb418.en.html, which doesn't have any content.

If you put a try/except block in parseContent, you can simply ignore that failure:

def parseContent(urls):
    pr_list = []
    for url in urls:
        res = requests.get(url)
        article = bs4.BeautifulSoup(res.text, "html.parser")
        try:
            date = try_several_expressions(article, date_expressions).text.strip()
            title = try_several_expressions(article, title_expressions).text.strip()
            body = try_several_expressions(article, body_expressions).text
        except ValueError:
            print(f'failed to parse: {url}')
            continue

        row = [date, title, body]
        pr_list.append(row)

    return pr_list

Answered By - larsks

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, January 28, 2024

[FIXED] How to iteratively retrieve the right information from beautiful soup elements?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels