Issue
ANSWER FOUND, see the bottom of this message
So I use BeautifulSoup to parse html content and while inspecting a Tag object created by soup, I have encountered very strange string that looks like the image below - this is through 'inspecting variables' in my IDE, pycharm.
The real HTML looks like below:
When I call Tag.getText().strip()
, those 'NBSP' are not removed. Essentially all I want is just the text 'Take action for animals suffering now \n.'. But those NBSP does not seem to be the entities because I'd assume soup would've converted them to spaces? What are those NBSP's and how to get rid of them?
Thanks
EDIT ANSWER FOUND I tried Claude and it gave me the right answer. I post here below, hopefully it is useful for others.
In short, these 'NBSP' (I also had ZWNJ in another) are unicode characters that show up as white spaces after parsing HTML. The are NOT html entities so soup does not parse them. One reason these appear in your HTML is possibly the authors copied content from Word and inherited whitespaces - obviously that is not good practice.
To get rid of such content, do
import string
text = tagobject.getText() #'tagobject' is whichever soup tag object you want to process
text=''.join(filter(lambda x: x in string.printable, text))
Alternatively, if you know the unicode for the character you want to rid of, you can use text.replace(u'\xa0', ' ')
Solution
From the question itself:
EDIT ANSWER FOUND
I tried Claude and it gave me the right answer. I post here below, hopefully it is useful for others.
In short, these 'NBSP' (I also had ZWNJ in another) are unicode characters that show up as white spaces after parsing HTML. The are NOT html entities so soup does not parse them. One reason these appear in your HTML is possibly the authors copied content from Word and inherited whitespaces - obviously that is not good practice.
To get rid of such content, do
import string text = tagobject.getText() #'tagobject' is whichever soup tag object you want to process text=''.join(filter(lambda x: x in string.printable, text))
Alternatively, if you know the unicode for the character you want to rid of, you can use
text.replace(u'\xa0', ' ')
Answered By - Driftr95
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.