Issue
When I run script.py below in Spyder (IPython console), I get different values for b
's hash between runs, whereas a
's hash stays the same.
script.py
from sympy import Symbol
import numpy as np
x = Symbol("x")
a = np.array([ 1, x])
b = np.array([1.0, x])
print(hash(a.tobytes()))
print(hash(b.tobytes()))
Running this script, yields the following outputs.
In [1]: runfile(./script.py)
-1258340495102975319
3795610135772286033
In [2]: runfile(./script.py)
-1258340495102975319
7432739601143179777
In [3]: runfile(./script.py)
-1258340495102975319
1451381667883822748
In [4]: runfile(./script.py)
-1258340495102975319
2683979045255549228
In [5]: runfile(./script.py)
-1258340495102975319
-345973347917904018
Please may someone shed some light on this strange behaviour, and possibly suggest a solution to give cosistent hashes for the floating point case.
I've tried simulating the same behaviour inside a for loop but in this case, the hashes are consistent. The difference occur when the code is ran multiple times.
I wouldn't expect the hashes of a
and b
to be the same, since 1
and 1.0
are different data types, but I would expect the hashes of both to be consistent over runs (as a
is).
The problem isn't specific to hashing, the tobytes() method gives different results over each run but the hash gives a more obvious representation of the differences.
EDIT: After a little more testing, I've realised that the problem is not just specific to SymPy, but the same behaviour also happens with a NumPy array with an object data type. For instance print(hash(np.array([1.0], dtype="O").tobytes()))
gives different results over different runs.
EDIT2: There is still some unexplained behaviour given the pointers answers, as this behaviour is only specific to arrays with object data type.
In [2]: hash(np.array([1.0]).tobytes())
Out[2]: -1405879698645296540
In [3]: hash(np.array([1.0]).tobytes())
Out[3]: -1405879698645296540
In [4]: hash(np.array([1.0]).tobytes())
Out[4]: -1405879698645296540
In [5]: hash(np.array([1.0]).tobytes())
Out[5]: -1405879698645296540
In [6]: hash(np.array([1.0], dtype="O").tobytes())
Out[6]: 7075328050134915067
In [7]: hash(np.array([1.0], dtype="O").tobytes())
Out[7]: -6443853770133964536
In [8]: hash(np.array([1.0], dtype="O").tobytes())
Out[8]: 889083274033361878
In [9]: hash(np.array([1.0], dtype="O").tobytes())
Out[9]: -6819397306369441685
Solution
Item 1: Numpy arrays can only hold numerical values. When you put a syms
, or any other non-numerical, object into an array, even None
, the dtype of the array will be object
, and the values will be pointers.
Item 2: The hash of an array, or really any sane object, depends on the value. That means that two distinct arrays with the same dtype and numerical values should have the same hash.
Item 3: Unlike ID, which is generally salted differently for each run, hashes are computed consistently between runs. For example, the hash of an int
in CPython is the value of the int
. The hash of a float
is consistent between runs as well.
You are creating a pair of arrays of pointers. The value of 1
in the array is a pointer to the interned int
object created by the interpreter. x
is a pointer too. What appears to be happening is that the elements of the first array are allocated in the same places between runs of the script, while the float
does not. If you had arrays of int
s or float
s, with a non-pointer dtype, the data would be hashed directly and all the hashes would be identical.
To print the pointer values in the array, you can use a trick to bypass the guards numpy puts in place. a.view(np.uint64)
and a.astype(np.uint64)
will not work because of these safeguards.
In [1]: a = np.empty(2, object); a[0] = 1; a[1] = None
In [2]: a.dtype.itemsize
Out[2]: 8
In [3]: x = np.ndarray(shape=2, dtype=np.uint64, buffer=a)
In [4]: x
Out[4]: array([11476160, 11272160], dtype=uint64)
By printing the raw array values (pointers), you should be able to confirm which objects are allocated in the same places as the prior run.
Answered By - Mad Physicist
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.