Issue
In my code, I encode a string with utf-8. I get the output, convert it to a string, and send it to my other program. The other program gets this string, but, when I try to decode the string, it gives me an error, AttributeError: 'str' object has no attribute 'decode'. I need to send the encoded data as a string because my other program receives it in a json. My first program is in python 3, and the other program is in python 2.
# my first program
x = u"宇宙"
x = str(x.encode('utf-8'))
# my other program
text = x.decode('utf-8')
print(text)
What should I do to convert the string received by the second program to bytes so the decode works?
Solution
The most important part to properly answer this is the information on how you pass these objetcts to the Python2 program: you are using JSON.
So, stay with me:
After you do the .encode
step in program 1, you have a bytes object. By calling str(...)
on it, you are just putting a escaping layer on this bytes object, and turning it back to a string - but when this string is written as is to a file, or transmited over the network, it will be encoded again - any non-ASCII tokens are usually escaped with the \u
prefix and the codepoint for each character - but the original Chinese chracters themselves are now encoded in utf-8 and doubly-escaped.
Python's JSON load methods already decode the contents of json data into text-strings: so a decode method is not to be expected at all.
In short: to pass data around, simply encode your original text as JSON in the first program, and do not botter with any decoding after json.load
on the target Python 2 program:
# my first program
x = "宇宙"
# No str-encode-decode dance needed here.
...
data = json.dumps({"example_key": x, ...})
# code to transmit json string by network or file as it is...
# my other program
text = json.loads(data)["example_key"]
# text is a Unicode text string ready to be used!
As you are doing, you are probably gettint the text doubly-encoded - I will mimick it on the Python 3 console. I will print the result from each step so you can undestand the transforms that are taking place.
In [1]: import json
In [2]: x = "宇宙"
In [3]: print(x.encode("utf-8"))
b'\xe5\xae\x87\xe5\xae\x99'
In [4]: text = str(x.encode("utf-8"))
In [5]: print(text)
b'\xe5\xae\x87\xe5\xae\x99'
In [6]: json_data = json.dumps(text)
In [7]: print(json_data)
"b'\\xe5\\xae\\x87\\xe5\\xae\\x99'"
# as you can see, it is doubly escaped, and it is mostly useless in this form
In [8]: recovered_from_json = json.loads(json_data)
In [9]: print(recovered_from_json)
b'\xe5\xae\x87\xe5\xae\x99'
In [10]: print(repr(recovered_from_json))
"b'\\xe5\\xae\\x87\\xe5\\xae\\x99'"
In [11]: # and if you have data like this in files/databases you need to recover:
In [12]: import ast
In [13]: recovered_text = ast.literal_eval(recovered_from_json).decode("utf-8")
In [14]: print(recovered_text)
宇宙
Answered By - jsbueno
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.