Friday, January 12, 2024

[FIXED] Different hash values for NumPy array with SymPy symbols when running in Spyder

January 12, 2024 ipython, numpy, python, sympy No comments

Issue

When I run script.py below in Spyder (IPython console), I get different values for b's hash between runs, whereas a's hash stays the same.

script.py

from sympy import Symbol
import numpy as np

x = Symbol("x")

a = np.array([  1, x]) 
b = np.array([1.0, x])
print(hash(a.tobytes()))
print(hash(b.tobytes()))

Running this script, yields the following outputs.

In [1]: runfile(./script.py)
-1258340495102975319
3795610135772286033

In [2]: runfile(./script.py)
-1258340495102975319
7432739601143179777

In [3]: runfile(./script.py)
-1258340495102975319
1451381667883822748

In [4]: runfile(./script.py)
-1258340495102975319
2683979045255549228

In [5]: runfile(./script.py)
-1258340495102975319
-345973347917904018

Please may someone shed some light on this strange behaviour, and possibly suggest a solution to give cosistent hashes for the floating point case.

I've tried simulating the same behaviour inside a for loop but in this case, the hashes are consistent. The difference occur when the code is ran multiple times.

I wouldn't expect the hashes of a and b to be the same, since 1 and 1.0 are different data types, but I would expect the hashes of both to be consistent over runs (as a is).

The problem isn't specific to hashing, the tobytes() method gives different results over each run but the hash gives a more obvious representation of the differences.

EDIT: After a little more testing, I've realised that the problem is not just specific to SymPy, but the same behaviour also happens with a NumPy array with an object data type. For instance print(hash(np.array([1.0], dtype="O").tobytes())) gives different results over different runs.

EDIT2: There is still some unexplained behaviour given the pointers answers, as this behaviour is only specific to arrays with object data type.

In [2]: hash(np.array([1.0]).tobytes())
Out[2]: -1405879698645296540

In [3]: hash(np.array([1.0]).tobytes())
Out[3]: -1405879698645296540

In [4]: hash(np.array([1.0]).tobytes())
Out[4]: -1405879698645296540

In [5]: hash(np.array([1.0]).tobytes())
Out[5]: -1405879698645296540

In [6]: hash(np.array([1.0], dtype="O").tobytes())
Out[6]: 7075328050134915067

In [7]: hash(np.array([1.0], dtype="O").tobytes())
Out[7]: -6443853770133964536

In [8]: hash(np.array([1.0], dtype="O").tobytes())
Out[8]: 889083274033361878

In [9]: hash(np.array([1.0], dtype="O").tobytes())
Out[9]: -6819397306369441685

Solution

Item 1: Numpy arrays can only hold numerical values. When you put a syms, or any other non-numerical, object into an array, even None, the dtype of the array will be object, and the values will be pointers.

Item 2: The hash of an array, or really any sane object, depends on the value. That means that two distinct arrays with the same dtype and numerical values should have the same hash.

Item 3: Unlike ID, which is generally salted differently for each run, hashes are computed consistently between runs. For example, the hash of an int in CPython is the value of the int. The hash of a float is consistent between runs as well.

You are creating a pair of arrays of pointers. The value of 1 in the array is a pointer to the interned int object created by the interpreter. x is a pointer too. What appears to be happening is that the elements of the first array are allocated in the same places between runs of the script, while the float does not. If you had arrays of ints or floats, with a non-pointer dtype, the data would be hashed directly and all the hashes would be identical.

To print the pointer values in the array, you can use a trick to bypass the guards numpy puts in place. a.view(np.uint64) and a.astype(np.uint64) will not work because of these safeguards.

In [1]: a = np.empty(2, object); a[0] = 1; a[1] = None

In [2]: a.dtype.itemsize
Out[2]: 8

In [3]: x = np.ndarray(shape=2, dtype=np.uint64, buffer=a)

In [4]: x
Out[4]: array([11476160, 11272160], dtype=uint64)

By printing the raw array values (pointers), you should be able to confirm which objects are allocated in the same places as the prior run.

Answered By - Mad Physicist

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, January 12, 2024

[FIXED] Different hash values for NumPy array with SymPy symbols when running in Spyder

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels