Issue
I am trying to decode from a Brazilian Portogese text:
'Demais Subfun\xc3\xa7\xc3\xb5es 12'
It should be
'Demais Subfunções 12'
>> a.decode('unicode_escape')
>> a.encode('unicode_escape')
>> a.decode('ascii')
>> a.encode('ascii')
all give:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13:
ordinal not in range(128)
on the other hand this gives:
>> print a.encode('utf-8')
Demais Subfun├â┬º├â┬Áes 12
>> print a
Demais Subfunções 12
Solution
You have binary data that is not ASCII encoded. The \xhh
codepoints indicate your data is encoded with a different codec, and you are seeing Python produce a representation of the data using the repr()
function that can be re-used as a Python literal that accurately lets you re-create the exact same value. This representation is very useful when debugging a program.
In other words, the \xhh
escape sequences represent individual bytes, and the hh
is the hex value of that byte. You have 4 bytes with hex values C3, A7, C3 and B5, that do not map to printable ASCII characters so Python uses the \xhh
notation instead.
You instead have UTF-8 data, decode it as such:
>>> 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
u'Demais Subfun\xe7\xf5es 12'
>>> print 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
Demais Subfunções 12
The C3 A7 bytes together encode U+00E7 LATIN SMALL LETTER C WITH CEDILLA, while the C3 B5 bytes encode U+00F5 LATIN SMALL LETTER O WITH TILDE.
ASCII happens to be a subset of the UTF-8 codec, which is why all the other letters can be represented as such in the Python repr()
output.
Answered By - Martijn Pieters
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.