Monday, October 4, 2021

[FIXED] Why does Python 2 think these bytes are the microphone emoji but Python 3 doesn't?

October 04, 2021 emoji, python, python-2.x, python-3.x, unicode No comments

Issue

I have some data in a database which was inputted by a user as "BTS⚾️>BTS🎤", i.e. "BTS" + the baseball emoji + ">BTS" + the microphone emoji. When I read it from the database, decode it, and print it in Python 2, it displays the emojis correctly. But when I try to decode the same bytes in Python 3, it fails with a UnicodeDecodeError.

The bytes in Python 2:

>>> data
'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'

Decoding these as UTF-8 outputs this unicode string:

>>> 'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
u'BTS\u26be\ufe0f>BTS\U0001f3a4'

Printing that unicode string on my Mac displays the baseball and microphone emojis:

>>> print u'BTS\u26be\ufe0f>BTS\U0001f3a4'
BTS⚾️>BTS🎤

However in Python 3, decoding the same bytes as UTF-8 gives me an error:

>>> b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 13: invalid continuation byte

In particular, it seems to take issue with the last 6 bytes (the microphone emoji):

>>> b'\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

Furthermore, other tools, like this online hex to Unicode converter, tell me these bytes are not a valid Unicode character:

https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4

Why do Python 2 and whatever program encoded the user's input think these bytes are the microphone emoji, but Python 3 and other tools do not?

Solution

It looks like there are a couple web pages that will help answer your question:

https://bugs.python.org/issue9133 (Relates to Python 2's overly permissive UTF-8 handling)
How to work with surrogate pairs in Python? (Relates to dealing with that permissiveness)

If I decode the bytes you got from Python 2 using Python 3's "surrogatepass" error handler, that is:

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8',
    errors = 'surrogatepass')

then I get the string 'BTS⚾️>BTS\ud83c\udfa4', where '\ud83c\udfa4' is a surrogate pair that's supposed to stand in for the microphone emogi.

You can get back to the microphone in Python 3 by encoding the string with surrogate pairs as UTF-16 with "surrogate pass" and decoding as UTF-16:

>>> string_as_utf_8 = b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8', errors='surrogatepass')
>>> bytes_as_utf_16 = string_as_utf_8.encode('utf_16', errors='surrogatepass')
>>> string_as_utf_16 = bytes_as_utf_16.decode('utf_16')
>>> print(string_as_utf_16)
BTS⚾️>BTS🎤

Answered By - jjramsey

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, October 4, 2021

[FIXED] Why does Python 2 think these bytes are the microphone emoji but Python 3 doesn't?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels