Monday, January 1, 2024

[FIXED] what's the difference between unicode(self) and self.unicode() in a Python Class?

January 01, 2024 class, python, python-2.x, unicode No comments

Issue

while handling unicode problem, I found that unicode(self) and self.__unicode__() have different behaviour:

#-*- coding:utf-8 -*-
import sys
import dis
class test():
    def __unicode__(self):
        s = u'中文'
        return s.encode('utf-8')

    def __str__(self):
        return self.__unicode__()
print dis.dis(test)
a = test()
print a

the above code works okay, but if I change self.__unicode__() to unicode(self), it will show error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

the code with problem is:

#-*- coding:utf-8 -*-
import sys
import dis
class test():
    def __unicode__(self):
        s = u'中文'
        return s.encode('utf-8')

    def __str__(self):
        return unicode(self)
print dis.dis(test)
a = test()
print a

very curious about how python handle this, I tried dis module but didn't see too many difference:

Disassembly of __str__:
 12           0 LOAD_FAST                0 (self)
              3 LOAD_ATTR                0 (__unicode__)
              6 CALL_FUNCTION            0
              9 RETURN_VALUE

Disassembly of __str__:
 10           0 LOAD_GLOBAL              0 (unicode)
              3 LOAD_FAST                0 (self)
              6 CALL_FUNCTION            1
              9 RETURN_VALUE

Solution

s = u'中文'
return s.encode('utf-8')

This returns a non-Unicode, byte string. That's what encode is doing. utf-8 is not a thing that magically turns data into Unicode; if anything, it's the opposite - a way of representing Unicode (an abstraction) in bytes (data, more or less).

We need a bit of terminology here. To encode is to take a Unicode string and making a byte string that represents it, using some kind of encoding. To decode is the reverse: taking a byte string (that we think encodes a Unicode string), and interpreting it as a Unicode string, using a specified encoding.

When we encode to a byte string and then decode using the same encoding, we get the original Unicode back.

utf-8 is one possible encoding. There are many, many more.

Sometimes Python will report a UnicodeDecodeError when you call encode. Why? Because you try to encode a byte string. The proper input for this process is a Unicode string, so Python "helpfully" tries to decode the byte string to Unicode first. But it doesn't know what codec to use, so it assumes ascii. This codec is the safest choice, in an environment where you could receive all kinds of data. It simply reports an error for bytes >= 128, which are handled in a gazillion different ways in various 8-bit encodings. (Remember trying to import a Word file with letters like é from a Mac to a PC or vice-versa, way back in the day? You'd get some other weird symbol on the other computer, because the platform built-in encoding was different.)

Making things even more complicated, in Python 2 the encode/decode mechanism is also used to implement some other neat things that have nothing to do with interpreting Unicode. For example, there is a Base64 encoder, and a thing that automatically handles string escape sequences (i.e. it will change a backslash, followed by a letter 't', into a tab). Some of these do "encode" or "decode" from a byte string to a byte string, or from Unicode to Unicode.

(By the way, this all works completely differently - much more clearly, IMHO - in Python 3.)

Similarly, when __unicode__ returns a byte string (which it should not, as a matter of style), the Python unicode() built-in function automatically decodes it as ascii; and when __str__ returns a Unicode string (which again it should not), str() will encode it as ascii. This happens behind the scenes, in code you cannot control. However, you can fix __unicode__ and __str__ to do what they are supposed to do.

(You can, in fact, override the encoding for unicode, by passing a second parameter. However, this is the wrong solution here since you should already have a Unicode string returned from __unicode__. And str doesn't take an encoding parameter, so you're out of luck there.)

So, now we can solve the problem.

Problem: We want __unicode__ to return the Unicode string u'中文', and we want __str__ to return the utf-8-encoded version of that.

Solution: return that string directly in __unicode__, and do the encoding explicitly in __str__:

class test():
    def __unicode__(self):
        return u'中文'

    def __str__(self):
        return unicode(self).encode('utf-8')

Answered By - Karl Knechtel

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 1, 2024

[FIXED] what's the difference between unicode(self) and self.unicode() in a Python Class?

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels