Issue
I have a html string:
Normal<span style="font-weight: bold;">Bold <span style="font-style: italic;">BoldAndItalic</span></span><span style="font-style: italic;">Italic</spa
Which renders out to
NormalBold BoldAndItalicItalic
What I want to get is a python dictionary, that lists all the attributes given, to all the pieces of text, kind of like this:
[
{"text":"Normal","styles":{"color":None,"font-style":None,"font-weight":None}},
{"text":"Bold","styles":{"color":None,"font-style":None,"font-weight":"bold"}},
{"text":" ","styles":{"color":None,"font-style":None,"font-weight":None}},
{"text":"BoldAndItalic","styles":{"color":None,"font-style":"italic","font-weight":"bold"}},
{"text":"Italic","styles":{"color":None,"font-style":"italic","font-weight":None}}
]
Where None could be considered default/not given.
However, when I parse the html string via the BeautifulSoup Library, I cannot find a way to access each section of text individually, I need to access the span tags first, and since there are span tags within other span tags, making a parser myself becomes very difficult, which I cannot seem to figure out.
What I attempted was this:
def stylesandtext(obj):
if obj.decomposed:
return
text = obj.text
styles={"color":None,"font-weight":None}
stylestr = obj.attrs['style'].split(": ")
styles[stylestr[0]] = stylestr[1].replace(";","")
if obj.find('span') !=None:
getsecstyle = stylesandtext(obj.find('span'))['styles']
if getsecstyle['color'] !=None:
styles['color'] = getsecstyle['color']
if getsecstyle['font-weight'] !=None:
styles['font-weight'] = getsecstyle['font-weight']
obj.find('span').decompose()
return {"text":text,"styles":styles}
Where, using BeautifulSoup, for every span tag, I tried to check for inner span tags, got the attributes, and combined it to the Dict storing all the stuff. It kind of worked, but it did not pick any words not in the span tags.
How would I go about getting the attributes to a text section?
Few more details:
- There are no other tags than span
- I only need to count for a few style attributes: color, font-weight, font-style, text-decoration
Solution
There is a attribute called children
, for every tag, which separates all the elements inside it(like soup.children
orsoup.span.children
).
Then I can recursively, in a function, get all attributes and text, which I store in a list.
This is the code I figured out:
import bs4
def get_as_list(obj,extstyle=None):
alldata = []
style = {"color":None,"font-weight":None,"font-style":None,"text-decoration":None}
if extstyle != None:
style=extstyle
if 'style' in obj.attrs:
spanstyleaslist = obj.attrs['style'].split(": ")
#obj.attrs is like {'style': 'color: #55FF55'}
style[spanstyleaslist[0]] = spanstyleaslist[1]
stuffaslist = list(obj.children)
for x in stuffaslist:
if type(x) == bs4.element.NavigableString:
alldata.append({'text':str(x),'styles':style})
else:
alldata.extend(get_as_list(x,style))
return alldata
which I call in a external function, for every element in the soup.children
, excluding NavigableStrings
.
Answered By - Tejasisamazing
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.