Tuesday, February 8, 2022

[FIXED] Unexpected performance in Numpy from multiplying by 1

February 08, 2022 numpy, performance, python No comments

Issue

Tested on Python 3.7.9 64-bit on Windows and Numpy 1.19.5

This is a pretty simple but confusing one.

Consider I get myself a pretty large array of shape (10000, 16).

import time
import numpy as np

arr = np.random.random((10000, 16))

Now I want to take the dot product of each of the rows with each of the other rows. To do that, I multiply the array by itself transposed. The resultant array will have a size of (10000, 10000). This is a pretty expensive operation and I don't expect it to be quick. Let's time it.

def measure(func):
    start = time.time()
    func()
    print(time.time() - start)

>>> measure(lambda: arr * arr.T)
0.875575065612793

No surprises here. As performant as Numpy is, it still takes almost a whole second to compute the result.

But what if...

>>> measure(lambda: arr * 1 @ arr.T)
0.4331023693084717

Somehow multiplying the matrix by 1 before performing the matrix multiplication has sped up calculations.

From testing, this also holds if arr is of other data-types.

>>> arr = arr.astype('float32')
>>> measure(lambda: arr @ arr.T)
0.6592690944671631
>>> measure(lambda: arr * 1 @ arr.T)
0.22941327095031738

We can see that they are indeed computing the same result.

>>> np.max(np.abs(arr @ arr.T - arr * 1 @ arr.T))
1.9073486e-06

Does multiplying the array by 1 (or any other scalar) give it some superpower? We can test it.

>>> arr_times_1 = arr * 1
>>> measure(lambda: arr_times_1 @ arr.T)
0.23055601119995117

Looks like it does. Does it somehow change the array? (The answer is no.)

>>> np.max(np.abs(arr - arr_times_1))
0.0

Can we "capture" this superpower?

>>> arr_copy_1 = arr_times_1.copy()
>>> arr_copy_2 = np.array(arr_times_1)
>>> measure(lambda: arr_copy_1 @ arr.T)
0.2252507209777832
>>> measure(lambda: arr_copy_2 @ arr.T)
0.22612690925598145

Seems like we can. So is something wrong with the array that np.random.random gave us?

>>> arr_copy_3 = np.array(arr)
>>> measure(lambda: arr_copy_2 @ arr.T)
0.2222919464111328

This result certainly supports that theory.

>>> arr_copy_4 = arr.copy()
>>> measure(lambda: arr_copy_4 @ arr.T)
0.23076415061950684

Even merely calling copy() on the original array seems to fix the issue. So what could be the issue?

>>> arr.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

>>> arr_times_1.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Something wrong with the binary data?

>>> arr_bytes = arr.tobytes()
>>> arr_times_1_bytes = arr_times_1.tobytes()
>>> arr_bytes == arr_times_1_bytes
True

No differences.

Why?

Solution

Turns out the slowdown is being caused because the arrays being multiplied shared the same memory, and that multiplying the array by 1 created a separate array in memory.

If we do

>>> measure(lambda: arr_times_1 @ arr_times_1.T)
0.6356322765350342

We again observe the degenerate case. The trick therefore is to simply

>>> measure(lambda: arr.copy() @ arr.T)
0.2263638973236084

Answered By - Luke

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, February 8, 2022

[FIXED] Unexpected performance in Numpy from multiplying by 1

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels