Thursday, November 23, 2023

[FIXED] BeautifulSoup parsed HTML: strange NBSP string, how to get rid of them?

November 23, 2023 beautifulsoup No comments

Issue

ANSWER FOUND, see the bottom of this message

So I use BeautifulSoup to parse html content and while inspecting a Tag object created by soup, I have encountered very strange string that looks like the image below - this is through 'inspecting variables' in my IDE, pycharm.

The real HTML looks like below:

When I call Tag.getText().strip(), those 'NBSP' are not removed. Essentially all I want is just the text 'Take action for animals suffering now \n.'. But those NBSP does not seem to be the entities because I'd assume soup would've converted them to spaces? What are those NBSP's and how to get rid of them?

Thanks

EDIT ANSWER FOUND I tried Claude and it gave me the right answer. I post here below, hopefully it is useful for others.

In short, these 'NBSP' (I also had ZWNJ in another) are unicode characters that show up as white spaces after parsing HTML. The are NOT html entities so soup does not parse them. One reason these appear in your HTML is possibly the authors copied content from Word and inherited whitespaces - obviously that is not good practice.

To get rid of such content, do

import string
text = tagobject.getText() #'tagobject' is whichever soup tag object you want to process    
text=''.join(filter(lambda x: x in string.printable, text))

Alternatively, if you know the unicode for the character you want to rid of, you can use text.replace(u'\xa0', ' ')

Solution

From the question itself:

EDIT ANSWER FOUND

I tried Claude and it gave me the right answer. I post here below, hopefully it is useful for others.

In short, these 'NBSP' (I also had ZWNJ in another) are unicode characters that show up as white spaces after parsing HTML. The are NOT html entities so soup does not parse them. One reason these appear in your HTML is possibly the authors copied content from Word and inherited whitespaces - obviously that is not good practice.

To get rid of such content, do
import string
text = tagobject.getText() #'tagobject' is whichever soup tag object you want to process    
text=''.join(filter(lambda x: x in string.printable, text))
Alternatively, if you know the unicode for the character you want to rid of, you can use text.replace(u'\xa0', ' ')

Answered By - Driftr95

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, November 23, 2023

[FIXED] BeautifulSoup parsed HTML: strange NBSP string, how to get rid of them?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels