Friday, August 5, 2022

[FIXED] How to get all attributes of section of text from a html string in Python?

August 05, 2022 beautifulsoup, css, html, python No comments

Issue

I have a html string:

Normal<span style="font-weight: bold;">Bold <span style="font-style: italic;">BoldAndItalic</span></span><span style="font-style: italic;">Italic</spa

Which renders out to

NormalBold BoldAndItalicItalic

What I want to get is a python dictionary, that lists all the attributes given, to all the pieces of text, kind of like this:

[
 {"text":"Normal","styles":{"color":None,"font-style":None,"font-weight":None}},
 {"text":"Bold","styles":{"color":None,"font-style":None,"font-weight":"bold"}},
 {"text":" ","styles":{"color":None,"font-style":None,"font-weight":None}},
 {"text":"BoldAndItalic","styles":{"color":None,"font-style":"italic","font-weight":"bold"}},
 {"text":"Italic","styles":{"color":None,"font-style":"italic","font-weight":None}}
]

Where None could be considered default/not given.

However, when I parse the html string via the BeautifulSoup Library, I cannot find a way to access each section of text individually, I need to access the span tags first, and since there are span tags within other span tags, making a parser myself becomes very difficult, which I cannot seem to figure out.

What I attempted was this:

def stylesandtext(obj):
    if obj.decomposed:
        return
    text = obj.text
    styles={"color":None,"font-weight":None}
    stylestr = obj.attrs['style'].split(": ")
    styles[stylestr[0]] = stylestr[1].replace(";","")
    if obj.find('span') !=None:
        getsecstyle = stylesandtext(obj.find('span'))['styles']
        if getsecstyle['color'] !=None:
            styles['color'] = getsecstyle['color']
        if getsecstyle['font-weight'] !=None:
            styles['font-weight'] = getsecstyle['font-weight']
        obj.find('span').decompose()
    return {"text":text,"styles":styles}

Where, using BeautifulSoup, for every span tag, I tried to check for inner span tags, got the attributes, and combined it to the Dict storing all the stuff. It kind of worked, but it did not pick any words not in the span tags.

How would I go about getting the attributes to a text section?

Few more details:

There are no other tags than span
I only need to count for a few style attributes: color, font-weight, font-style, text-decoration

Solution

There is a attribute called children, for every tag, which separates all the elements inside it(like soup.children orsoup.span.children).

Then I can recursively, in a function, get all attributes and text, which I store in a list.

This is the code I figured out:

import bs4
def get_as_list(obj,extstyle=None):
    alldata = []


    style = {"color":None,"font-weight":None,"font-style":None,"text-decoration":None}
    if extstyle != None:
        style=extstyle
    if 'style' in obj.attrs:
        spanstyleaslist = obj.attrs['style'].split(": ")
        #obj.attrs is like {'style': 'color: #55FF55'}
        style[spanstyleaslist[0]] = spanstyleaslist[1]

    stuffaslist = list(obj.children)
    for x in stuffaslist:
        if type(x) == bs4.element.NavigableString:
            alldata.append({'text':str(x),'styles':style})
        else:
            alldata.extend(get_as_list(x,style))
    return alldata

which I call in a external function, for every element in the soup.children, excluding NavigableStrings.

Answered By - Tejasisamazing

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, August 5, 2022

[FIXED] How to get all attributes of section of text from a html string in Python?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels