Issue
I'm trying to learn how to work with Unicode in python.
Let's say I have a file test
containing Unicode characters:
áéíóúabcdefgçë
I want to make a python script that prints out all the unique characters in the file. This is what I have:
#!/usr/bin/python
import sys
def main():
if len(sys.argv) < 2:
print("Argument required.")
exit()
else:
filename = sys.argv[1]
with open(filename, "r") as fp:
string = fp.read().replace('\n', '')
chars = set()
for char in string:
chars.add(char)
for char in chars:
sys.stdout.write(char)
print("")
if __name__ == "__main__":
main()
This doesn't print the Unicode characters properly:
$ ./unicode.py test
▒a▒bedgf▒▒▒▒c▒▒
What is the correct way to do this, so that the characters print properly?
Solution
Your data is encoded, most likely as utf-8. Utf-8 uses more than one byte to encode non-ascii characters, such as áéíóú
. Iterating over a string encoded as utf-8 yields the individual bytes that make up the string, rather than the characters that you are expecting.
>>> s = 'áéíóúabcdefgçë'
# There are 14 characters in s, but it contains 21 bytes
>>> len(s)
21
>>> s
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xbaabcdefg\xc3\xa7\xc3\xab'
# The first "character" (actually, byte) is unprintable.
>>> print s[0]
�
# So is the second.
>>> print s[1]
�
# But together they make up a character.
>>> print s[0:2]
á
So printing individual bytes doesn't work as expected.
>>> for c in s:print c,
...
� � � � � � � � � � a b c d e f g � � � �
But decoding the string to unicode, then printing does.
>>> for c in s.decode('utf-8'):print c,
...
á é í ó ú a b c d e f g ç ë
To make your code work as you expect, you need to decode the string you read from the file. Change
string = fp.read().replace('\n', '')
to
string = fp.read().replace('\n', '').decode('utf-8')
Answered By - snakecharmerb
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.