Issue
while handling unicode problem, I found that unicode(self)
and self.__unicode__()
have different behaviour:
#-*- coding:utf-8 -*-
import sys
import dis
class test():
def __unicode__(self):
s = u'中文'
return s.encode('utf-8')
def __str__(self):
return self.__unicode__()
print dis.dis(test)
a = test()
print a
the above code works okay, but if I change self.__unicode__()
to unicode(self)
, it will show error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
the code with problem is:
#-*- coding:utf-8 -*-
import sys
import dis
class test():
def __unicode__(self):
s = u'中文'
return s.encode('utf-8')
def __str__(self):
return unicode(self)
print dis.dis(test)
a = test()
print a
very curious about how python handle this, I tried dis module but didn't see too many difference:
Disassembly of __str__:
12 0 LOAD_FAST 0 (self)
3 LOAD_ATTR 0 (__unicode__)
6 CALL_FUNCTION 0
9 RETURN_VALUE
VS
Disassembly of __str__:
10 0 LOAD_GLOBAL 0 (unicode)
3 LOAD_FAST 0 (self)
6 CALL_FUNCTION 1
9 RETURN_VALUE
Solution
s = u'中文'
return s.encode('utf-8')
This returns a non-Unicode, byte string. That's what encode
is doing. utf-8 is not a thing that magically turns data into Unicode; if anything, it's the opposite - a way of representing Unicode (an abstraction) in bytes (data, more or less).
We need a bit of terminology here. To encode is to take a Unicode string and making a byte string that represents it, using some kind of encoding. To decode is the reverse: taking a byte string (that we think encodes a Unicode string), and interpreting it as a Unicode string, using a specified encoding.
When we encode to a byte string and then decode using the same encoding, we get the original Unicode back.
utf-8
is one possible encoding. There are many, many more.
Sometimes Python will report a UnicodeDecodeError
when you call encode
. Why? Because you try to encode
a byte string. The proper input for this process is a Unicode string, so Python "helpfully" tries to decode
the byte string to Unicode first. But it doesn't know what codec to use, so it assumes ascii
. This codec is the safest choice, in an environment where you could receive all kinds of data. It simply reports an error for bytes >= 128, which are handled in a gazillion different ways in various 8-bit encodings. (Remember trying to import a Word file with letters like é
from a Mac to a PC or vice-versa, way back in the day? You'd get some other weird symbol on the other computer, because the platform built-in encoding was different.)
Making things even more complicated, in Python 2 the encode
/decode
mechanism is also used to implement some other neat things that have nothing to do with interpreting Unicode. For example, there is a Base64 encoder, and a thing that automatically handles string escape sequences (i.e. it will change a backslash, followed by a letter 't', into a tab). Some of these do "encode" or "decode" from a byte string to a byte string, or from Unicode to Unicode.
(By the way, this all works completely differently - much more clearly, IMHO - in Python 3.)
Similarly, when __unicode__
returns a byte string (which it should not, as a matter of style), the Python unicode()
built-in function automatically decodes it as ascii
; and when __str__
returns a Unicode string (which again it should not), str()
will encode it as ascii
. This happens behind the scenes, in code you cannot control. However, you can fix __unicode__
and __str__
to do what they are supposed to do.
(You can, in fact, override the encoding for unicode
, by passing a second parameter. However, this is the wrong solution here since you should already have a Unicode string returned from __unicode__
. And str
doesn't take an encoding parameter, so you're out of luck there.)
So, now we can solve the problem.
Problem: We want __unicode__
to return the Unicode string u'中文'
, and we want __str__
to return the utf-8
-encoded version of that.
Solution: return that string directly in __unicode__
, and do the encoding explicitly in __str__
:
class test():
def __unicode__(self):
return u'中文'
def __str__(self):
return unicode(self).encode('utf-8')
Answered By - Karl Knechtel
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.