Thursday, December 30, 2021

[FIXED] Python 2 to 3 Migration Process - Differences Regarding Unicode

December 30, 2021 python, python-2.x, python-3.x, unicode No comments

Issue

I'm trying to migrate my code from Python2 to Python3 since Python2 is no longer supported.
However I'm having difficulties with the migration process because of the differences between the two versions. I know that Python2 used to have both string and unicode objects, while Python3 default storing of strings is unicode.

Somewhere in my code, I store a hexdigest representation of a tuple in a database.
I get this tuple from a user-filled form, and one of the values is of type unicode.
Since Python3 does not have the distinction between string and unicode, I've ended up with a different hexdigest representation of the tuple containing the same values.

Here is a code snippet showing my issue:

Python2 -

In [1]: from hashlib import sha1

In [2]: cred = ('user', 'pass')

In [3]: sha1(str(cred)).hexdigest()
Out[3]: '7cd99ee437e8166559f55a0336d4b48d9bc62bb2'

In [4]: unicode_cred = ('user', u'pass')

In [5]: sha1(str(unicode_cred)).hexdigest()
Out[5]: '807a138ff9b0dd6ce6a937e3df3bba3223b40fcd'

Python3 -

In [1]: from hashlib import sha1                                                

In [2]: cred = ('user', 'pass')                                                 

In [3]: sha1(str(cred)).hexdigest()                                             
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-847e91fdf4c5> in <module>
----> 1 sha1(str(cred)).hexdigest()

TypeError: Unicode-objects must be encoded before hashing

In [4]: sha1(str(cred).encode('utf-8')).hexdigest()                             
Out[4]: '7cd99ee437e8166559f55a0336d4b48d9bc62bb2'

In [5]: unicode_cred = ('user', u'pass')                                        

In [6]: sha1(str(unicode_cred).encode('utf-8')).hexdigest()                     
Out[6]: '7cd99ee437e8166559f55a0336d4b48d9bc62bb2'

As you can see, in Python2 Out[3] has a different value compared to Out[5], while in Python3 Out[4] and Out[6] are the same.

Is there a way to reproduce the value of Out[5] as shown in the Python2 snippet?
As part of the migration process, I need to make sure the same input produces the same output, so I won't insert a new record to my database instead of update an existing one.

Solution

Using a hex digest of str() output is the problem. str() is a version-dependent string and you need the exact same representation to form the hex digest:

Python 2

>>> unicode_cred = ('user', u'pass')
>>> str(unicode_cred)
"('user', u'pass')"

Python 3 (note the missing 'u'). The output of str() is also a Unicode string on Python 3, so it must be encoded to bytes to use with sha1(). The b is not part of the string, but just denotes it is now a byte string.

>>> unicode_cred = ('user', u'pass')
>>> str(unicode_cred).encode('utf-8')
b"('user', 'pass')"

You'll need to form the same string with the u to get the same digest, and it is a bit ugly. Here I use an f-string to custom format the tuple with a u. I also encode with ascii since non-ASCII characters will create an additional issue. Hopefully you don't have user names and passwords with non-ASCII.

>>> from hashlib import sha1
>>> unicode_cred = ('user', u'pass')
>>> f"('{unicode_cred[0]}', u'{unicode_cred[1]}')"
"('user', u'pass')"
>>> sha1(f"('{unicode_cred[0]}', u'{unicode_cred[1]}')".encode('ascii')).hexdigest()
'807a138ff9b0dd6ce6a937e3df3bba3223b40fcd'

Answered By - Mark Tolonen

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, December 30, 2021

[FIXED] Python 2 to 3 Migration Process - Differences Regarding Unicode

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels