Wednesday, July 20, 2022

[FIXED] Python - Beautifulsoup, differentiate parsed text inside of an html element by using internal tags

July 20, 2022 beautifulsoup, html, html-parsing, parsing, python-3.x No comments

Issue

So, I'm working on an html parser to extract some text data from a list of and format it before giving an output. I have a title that I need to set as bold, and a description which I'll leave as it is. I've found myself stuck when I reached this situation:

<div class ="Content">
  <Strong>Title:</strong>
  description
</div>

As you can see the strings are actually already formatted but I can't seem to find a way to get the tags and the text out together. What my script does kinda looks like:

article = "" #this is where I normally store all the formatted text, it's necessary that I get all the formatted text as one loooong string before I Output
temp1=""
temp2""
result = soup.findAll("div", {"class": "Content"})
if(result!=none):
  x=0
  for(i in result.find("strong")):
    if(x==0):
      temp1 = "<strong>" + i.text + "</strong>"
      article += temp1
      x=1
    else:
      temp2 = i.nextSibling #I know this is wrong
      article += temp2
      x = 0
print(article)

It actually throws an AttributeError but it's a wrong one since the output is "Did you call find_all() when you meant to call find()?".

I also know I can't just use .nextSibling like that and I'm litterally losing it over something that looks so simple to solve...

what I need to get is: "Title: description"

Thanks in advance for any response.

I'm sorry if I couldn't explain really well what I'm trying to accomplish but that's kind of articulated; I actually need the data to generate a POST request to a CKEditor session so that it adds the text to the html page, but I need the text to be formatted in a certain way before uploading it. In this case I would need to get the element inside the tags and format it in a certain way, then do the same with the description and print them one after the other, for example a request could look like:

http://server/upload.php?desc=<ul>%0D%0A%09<li><strong>Title%26nbsp%3B<%2strong>description<%2li><%2ul>

So that the result is:

Title1: description

So what I need to do is to differentiate between the element inside the tag and the one out of it using the tag itself as a refernce

Solution

EDIT

To select the <strong> use:

soup.select_one('div.Content strong')

and then to select its nextSibling:

strong.nextSibling

you my need to strip it to get rid of whitespaces, ....:

strong.nextSibling.strip()

Just in case

You can use ANSI escape sequences to print something bold, ... but I am not sure, why you would do that. That is something should be improved in your question.

Example

from bs4 import BeautifulSoup

html='''
<div class ="Content">
  <Strong>Title:</strong>
  description
</div>
'''
soup = BeautifulSoup(html,'html.parser')
text = soup.find('div', {'class': 'Content'}).get_text(strip=True).split(':')

print('\033[1m'+text[0]+': \033[0m'+ text[1])

Output

Title: description

Answered By - HedgeHog

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, July 20, 2022

[FIXED] Python - Beautifulsoup, differentiate parsed text inside of an html element by using internal tags

Issue

Solution

EDIT

Just in case

0 comments:

Post a Comment

Popular Posts

Labels