Issue
So, I'm working on an html parser to extract some text data from a list of and format it before giving an output. I have a title that I need to set as bold, and a description which I'll leave as it is. I've found myself stuck when I reached this situation:
<div class ="Content">
<Strong>Title:</strong>
description
</div>
As you can see the strings are actually already formatted but I can't seem to find a way to get the tags and the text out together. What my script does kinda looks like:
article = "" #this is where I normally store all the formatted text, it's necessary that I get all the formatted text as one loooong string before I Output
temp1=""
temp2""
result = soup.findAll("div", {"class": "Content"})
if(result!=none):
x=0
for(i in result.find("strong")):
if(x==0):
temp1 = "<strong>" + i.text + "</strong>"
article += temp1
x=1
else:
temp2 = i.nextSibling #I know this is wrong
article += temp2
x = 0
print(article)
It actually throws an AttributeError but it's a wrong one since the output is "Did you call find_all() when you meant to call find()?".
I also know I can't just use .nextSibling like that and I'm litterally losing it over something that looks so simple to solve...
what I need to get is: "Title: description"
Thanks in advance for any response.
I'm sorry if I couldn't explain really well what I'm trying to accomplish but that's kind of articulated; I actually need the data to generate a POST request to a CKEditor session so that it adds the text to the html page, but I need the text to be formatted in a certain way before uploading it. In this case I would need to get the element inside the tags and format it in a certain way, then do the same with the description and print them one after the other, for example a request could look like:
http://server/upload.php?desc=<ul>%0D%0A%09<li><strong>Title%26nbsp%3B<%2strong>description<%2li><%2ul>
So that the result is:
- Title1: description
So what I need to do is to differentiate between the element inside the tag and the one out of it using the tag itself as a refernce
Solution
EDIT
To select the <strong>
use:
soup.select_one('div.Content strong')
and then to select its nextSibling
:
strong.nextSibling
you my need to strip
it to get rid of whitespaces, ....:
strong.nextSibling.strip()
Just in case
You can use ANSI escape sequences to print something bold
, ... but I am not sure, why you would do that. That is something should be improved in your question.
Example
from bs4 import BeautifulSoup
html='''
<div class ="Content">
<Strong>Title:</strong>
description
</div>
'''
soup = BeautifulSoup(html,'html.parser')
text = soup.find('div', {'class': 'Content'}).get_text(strip=True).split(':')
print('\033[1m'+text[0]+': \033[0m'+ text[1])
Output
Title: description
Answered By - HedgeHog
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.