Issue
Tested on Python 3.7.9 64-bit on Windows and Numpy 1.19.5
This is a pretty simple but confusing one.
Consider I get myself a pretty large array of shape (10000, 16)
.
import time
import numpy as np
arr = np.random.random((10000, 16))
Now I want to take the dot product of each of the rows with each of the other rows. To do that, I multiply the array by itself transposed. The resultant array will have a size of (10000, 10000)
. This is a pretty expensive operation and I don't expect it to be quick. Let's time it.
def measure(func):
start = time.time()
func()
print(time.time() - start)
>>> measure(lambda: arr * arr.T)
0.875575065612793
No surprises here. As performant as Numpy is, it still takes almost a whole second to compute the result.
But what if...
>>> measure(lambda: arr * 1 @ arr.T)
0.4331023693084717
Somehow multiplying the matrix by 1
before performing the matrix multiplication has sped up calculations.
From testing, this also holds if arr
is of other data-types.
>>> arr = arr.astype('float32')
>>> measure(lambda: arr @ arr.T)
0.6592690944671631
>>> measure(lambda: arr * 1 @ arr.T)
0.22941327095031738
We can see that they are indeed computing the same result.
>>> np.max(np.abs(arr @ arr.T - arr * 1 @ arr.T))
1.9073486e-06
Does multiplying the array by 1
(or any other scalar) give it some superpower? We can test it.
>>> arr_times_1 = arr * 1
>>> measure(lambda: arr_times_1 @ arr.T)
0.23055601119995117
Looks like it does. Does it somehow change the array? (The answer is no.)
>>> np.max(np.abs(arr - arr_times_1))
0.0
Can we "capture" this superpower?
>>> arr_copy_1 = arr_times_1.copy()
>>> arr_copy_2 = np.array(arr_times_1)
>>> measure(lambda: arr_copy_1 @ arr.T)
0.2252507209777832
>>> measure(lambda: arr_copy_2 @ arr.T)
0.22612690925598145
Seems like we can. So is something wrong with the array that np.random.random
gave us?
>>> arr_copy_3 = np.array(arr)
>>> measure(lambda: arr_copy_2 @ arr.T)
0.2222919464111328
This result certainly supports that theory.
>>> arr_copy_4 = arr.copy()
>>> measure(lambda: arr_copy_4 @ arr.T)
0.23076415061950684
Even merely calling copy()
on the original array seems to fix the issue. So what could be the issue?
>>> arr.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> arr_times_1.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
Something wrong with the binary data?
>>> arr_bytes = arr.tobytes()
>>> arr_times_1_bytes = arr_times_1.tobytes()
>>> arr_bytes == arr_times_1_bytes
True
No differences.
Why?
Solution
Turns out the slowdown is being caused because the arrays being multiplied shared the same memory, and that multiplying the array by 1
created a separate array in memory.
If we do
>>> measure(lambda: arr_times_1 @ arr_times_1.T)
0.6356322765350342
We again observe the degenerate case. The trick therefore is to simply
>>> measure(lambda: arr.copy() @ arr.T)
0.2263638973236084
Answered By - Luke
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.