Issue
I'm getting UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to error while running this code::
from bs4 import BeautifulSoup
import requests
r = requests.get('https://stackoverflow.com').text
soup = BeautifulSoup(r, 'lxml')
print(soup.prettify())
and the output is:
Traceback (most recent call last):
File "c:\Users\Asus\Documents\Hello World\Web Scraping\st.py", line 5, in <module>
print(soup.prettify())
File "C:\Users\Asus\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 756: character maps to <undefined>
I'm using python 3.8.1 and UTF-8 in vs code. How to solve this?
Solution
There are hints in the full error message... I will keep here what seems most important:
Traceback ...
File "...\cp1252.py", ...
UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' ...
The error is caused by the print
call. Somewhere in you text, you have a ZERO WIDTH SPACE character (Unicode U+200B), and if you print to a Windows console, the string is internally encoded into the Windows console code page (cp1252 here). And the ZERO WIDTH SPACE is not represented in that code page. BTW the default console is not really unicode friendly in Windows.
There is little to do in a Windows console. I would advise you to try one of these workarounds:
do not print to the console but write to a (utf8) file. You will then be able to read it with a utf8 enabled text editor like notepad++
manually encode anything before printing it, with
errors='ignore'
orerrors='replace'
. That way, possibly offending characters will be ignored and no error will ariseprint(soup.prettify().encode('cp1252', errors='ignore'))
Answered By - Serge Ballesta
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.