Sunday, April 10, 2022

[FIXED] Use Beautiful Soup to unify #text after a tag

April 10, 2022 beautifulsoup, html, python, python-3.x, web-scraping No comments

Issue

I'm using Beautiful Soup to put in a excel table some infos from a website.

The bold titles are shown in the head columns while the text after the colon appear in the rows.

What I'm doing is finding the text and searching for next_sibling -->

  book_year = sibling.pre.find('b',text='Anno:').next_sibling.get_text().strip()

The problem is that in some cases the text after colon, is split in different #text part. So if I use the next_sibling, it'll get only a partial info.

As you can see in the inspector, the content of Titoli originali: will only be "da" if I use next_sibling.

Is there a way to unify all those #text parts? How would you approach this problem? Thank you

UPDATES:

This is the website I'm scraping from --> http://www.letteraturenordiche.it/danimarca.htm

It's giving me a hard time cause it has an incoherent structure and no use of classes.

One thing I did is to remove from the <pre> content all of the tags, <font> tags and <span> tags, to leave only the <b> ones and take the text after that.

Solution

Parsing this document isn't pretty. Probable the document is hand-written in Word and then exported to HTML:

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "http://www.letteraturenordiche.it/danimarca.htm"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

# preprocess the document:

# remove all whitespaces:
for w in soup.find_all(text=True):
    if not w.strip():
        w.extract()

# unwrap not necessary tags:
for t in soup.select("i, font, span"):
    t.unwrap()

# merge NavigableStrings together:
soup.smooth()

data = []
for t in soup.select("table"):
    title = t.p.get_text(separator=" ", strip=True)
    year = (
        t.select_one('b:-soup-contains("Anno:")')
        .find_next_sibling(text=True)
        .strip()
    )
    author = (
        t.find_previous("hr", attrs={"size": "6"})
        .find_previous("p")
        .get_text(strip=True)
    )
    editor = (
        t.select_one('b:-soup-contains("Editore:")')
        .find_next_sibling(text=True)
        .strip()
    )
    pages = (
        t.select_one('b:-soup-contains("Pagine:")')
        .find_next_sibling(text=True)
        .strip()
    )
    notes = (
        t.select_one('b:-soup-contains("Note:", "Comprende")')
        .find_next_sibling(text=True)
        .strip()
    )
    original_title = t.select_one(
        'b:-soup-contains("Titolo Original", "Titolo original", "Titoli originali")'
    )

    if not original_title:
        original_title = t.find(lambda t: t.text.strip() == ":")

    if not original_title:
        original_title = ""
    else:
        original_title = original_title.find_next_sibling(text=True).strip()

    data.append((title, year, author, editor, pages, notes, original_title))

df = pd.DataFrame(
    data,
    columns=[
        "title",
        "year",
        "author",
        "editor",
        "pages",
        "notes",
        "original_title",
    ],
)
df["title"] = df["title"].str.replace(r"\r?\n", " ", regex=True)
df["author"] = df["author"].str.replace(r"\r?\n", " ", regex=True)
print(df)
df.to_csv("data.csv", index=False)

Creates the dataframe and saves it as data.csv (screenshot from LibreOffice):

Answered By - Andrej Kesely

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, April 10, 2022

[FIXED] Use Beautiful Soup to unify #text after a tag

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels