Sunday, October 9, 2022

[FIXED] Python: working with unicode characters

October 09, 2022 python, python-2.x, unicode No comments

Issue

I'm trying to learn how to work with Unicode in python.

Let's say I have a file test containing Unicode characters: áéíóúabcdefgçë I want to make a python script that prints out all the unique characters in the file. This is what I have:

#!/usr/bin/python

import sys

def main():
    if len(sys.argv) < 2:
        print("Argument required.")
        exit()
    else:
        filename = sys.argv[1]
        with open(filename, "r") as fp:
            string = fp.read().replace('\n', '')
        chars = set()
        for char in string:
            chars.add(char)
        for char in chars:
            sys.stdout.write(char)
        print("")

if __name__ == "__main__":
    main()

This doesn't print the Unicode characters properly:

$ ./unicode.py test
▒a▒bedgf▒▒▒▒c▒▒

What is the correct way to do this, so that the characters print properly?

Solution

Your data is encoded, most likely as utf-8. Utf-8 uses more than one byte to encode non-ascii characters, such as áéíóú. Iterating over a string encoded as utf-8 yields the individual bytes that make up the string, rather than the characters that you are expecting.

>>> s = 'áéíóúabcdefgçë'
# There are 14 characters in s, but it contains 21 bytes
>>> len(s)
21
>>> s
'\xc3\xa1\xc3\xa9\xc3\xad\xc3\xb3\xc3\xbaabcdefg\xc3\xa7\xc3\xab'

# The first "character" (actually, byte) is unprintable.
>>> print s[0]
�
# So is the second.
>>> print s[1]
�
# But together they make up a character.
>>> print s[0:2]
á

So printing individual bytes doesn't work as expected.

>>> for c in s:print c,
... 
� � � � � � � � � � a b c d e f g � � � �

But decoding the string to unicode, then printing does.

>>> for c in s.decode('utf-8'):print c,
... 
á é í ó ú a b c d e f g ç ë

To make your code work as you expect, you need to decode the string you read from the file. Change

string = fp.read().replace('\n', '')

string = fp.read().replace('\n', '').decode('utf-8')

Answered By - snakecharmerb

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, October 9, 2022

[FIXED] Python: working with unicode characters

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels