Saturday, January 15, 2022

[FIXED] 'utf-8' codec can't decode byte reading a file in Python3.4 but not in Python2.7

January 15, 2022 python, python-3.x, utf-8 No comments

Issue

I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is: Codi;Codi_lloc_anonim;Nom

and the code of my program is:

def lectdict(filename,colkey,colvalue):
    f = open(filename,'r')
    D = dict()

    for line in f:
       if line == '\n': continue
       D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]

f.close
return D

Traduccio = lectdict('Noms_departaments_centres.txt',1,2)

Solution

In Python2,

f = open(filename,'r')
for line in f:

reads lines from the file as bytes.

In Python3, the same code reads lines from the file as strings. Python3 strings are what Python2 call unicode objects. These are bytes decoded according to some encoding. The default encoding in Python3 is utf-8.

The error message

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.

To fix the problem you need to specify the correct encoding of the file:

with open(filename, encoding=enc) as f:
    for line in f:

If you do not know the correct encoding, you could run this program to simply try all the encodings known to Python. If you are lucky there will be an encoding which turns the bytes into recognizable characters. Sometimes more than one encoding may appear to work, in which case you'll need to check and compare the results carefully.

# Python3
import pkgutil
import os
import encodings

def all_encodings():
    modnames = set(
        [modname for importer, modname, ispkg in pkgutil.walk_packages(
            path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases = set(encodings.aliases.aliases.values())
    return modnames.union(aliases)

filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
    try:
        with open(filename, encoding=enc) as f:
            # print the encoding and the first 500 characters
            print(enc, f.read(500))
    except Exception:
        pass

Answered By - unutbu

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, January 15, 2022

[FIXED] 'utf-8' codec can't decode byte reading a file in Python3.4 but not in Python2.7

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels