Tuesday, January 23, 2024

[FIXED] NumPy - Fill each subdiagonal of an array, with rows from another array leveraging vectorisation

January 23, 2024 numpy, numpy-ndarray, python No comments

Issue

I have a numpy array A of shape (N, M, M) and another numpy array B of shape (N, M). I want to fill each of the N diagonals of the submatrices A[i, :, :] with the rows in B. I want to do this as fast as possible using Numpy's vectorisation capabilities. Of course, one could use a for loop, but I want something more efficient.

import numpy as np

A = np.random.randn(1000, 50, 50)
B = np.random.randn(1000, 50)

for i in range(1000):
    np.fill_diagonal(A[i, :, :], B[i])

Benchmarking

Here I will benchmark the various solutions. From this question I have found method3 which is displayed below.

def method1(A, B):
    for i in range(A.shape[0]):
        np.fill_diagonal(A[i, :, :], B[i])
    return A

def method2(A, B):
    idx1, idx2 = np.indices(B.shape).reshape(2, -1)
    A[idx1, idx2, idx2] = B.ravel()
    return A

def method3(A, B):
    np.einsum('ijj->ij', A)[...] = B
    return A

Running %timeit on these three methods gives

Method1: 951 µs ± 3.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Method2: 289 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Method3: 47.6 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So method 3 is the fastest, although I was hoping for a better speedup.

Solution

Note: very long answer with "no improvement" comments all the way. But I end with a x10 improvement (that you likely wouldn't want to use, since it will probably slow down other parts of your code)

`as_strided` solution

Just for the sake of exhausitivity I can propose yet another version

def method4(A,B):
    view=np.lib.stride_tricks.as_strided(A, shape=B.shape, strides=(A.strides[0], A.strides[1]+A.strides[2]))
    view[:]=B
    return A

Which is the closer you can get to a C implementation of

for(int i=0; i<1000; i++){
    for(int j=0; j<50; j++){
        A[i,j,j]=B[i,j];
    }
}

Since that is almost what would occur inside numpy code (a double for loop, with just copy in it). To be more accurate, what would really be computed is

int sta1=2500;
int sta2=51;
int stb1=50;
int stb2=1;
for(int i=0; i<1000; i++){
    for(int j=0; j<50; j++){
        *(A+sta1*i+sta2*j) = *(B+stb1*i+stb2*j);
    }
}

But that is just the generic version of the same for loop (numpy's code has to take into account the all variety of arrays shape and stride it may have to copy)

So, don't want to seem too arrogant here, but I don't think you can do a lot faster with numpy (you may accelerate things with numba, or ctypes, by doing something less generic, for example with a hard-coded dtype, or even hard coded strides and shape, depending on the context)

Timings of my method4 is just the same as the ones of your method3 (einsum), or other methods posted here. Which again, is not surprising: it is just two ways that end up meaning in numpy code : double iteration and copy.

Comparison with the simplest copy of same amount of data

One other way to convince yourself that it is just that, and therefore that you can't really expect better (again, with pure numpy. With numba/ctypes, you may go faster) is to compare timings with the one you get from this function

def ref(A,B):
    A[:]=B
    return A

A=np.random.randn(1000,2500) # I really don't need it random, but let's stick as close as possible to the initial problem
B=np.random.randn(1000,50)
t=timeit.timeit(lambda: ref(A[:,::50], B), number=1000)

And you'll see that timings are identicals to the one from your method3 or my method4. Again, it is just a copy of 50000 values.

The reason why I don't compare to just

A=np.random.randn(1000,50)
B=np.random.randn(1000,50)
A[:]=B

(which is also just the copy of 50000 values) is because in this later example, all values of A are contiguous in memory. Which means that cache memory helps a lot. While it my ref method, it is only one element every 50 elements that is copied. So less efficient cache. And that "1 every 50" is very similar to what happens in method3 and 4 (except that it is "1 every 51", 49 times/50, and 2 contiguous every 50. But on averagae, that is also "1 every 50").

We often overlook cache importance, but it chances a lot.

On my computer, I get timings
4.91 ms for method1
1.94 ms for method2
1.0144 ms for method3 (the "44" is meaningless, but I include it to make clear that it is 3 different timings)
1.0139 ms for method4
1.0172 ms for ref

But only 0.035 for ref without the ::50. So, cache effect is huge (I insist: it is the exact same code, copying the exact same number of elements. Just they are not distributed the same way in memory) So, point is

comparison to a simple copy X[:,::50]=B shows that the timings are similar to the one you already have, so you can't expect to beat that
It was important to compare to X[:,::50]=B rather that to a X[:]=B to not bias the comparison by executing the copy in an easier cache situation.

Numba

from numba import njit

@njit
def method5(A,B):
    n,m=B.shape
    for i in range(n):
        for j in range(m):
            A[i,j,j]=B[i,j]
    return A

method5(A,B) # I need to call it once before benchmark, to avoid counting compilation in timing

Gives very slightly better result. But really slightly.

This time, timings are still 1.01 for method4, 1.00 for method3, but 0.91 "only" for method5 (numba). Which is a bit better (not a great improvement, but great enough to know it is not luck). Which is another indication that we are very close to the fastest possible timings (without trying to rely on hardware tricks, like threads)

ctypes

Likewise, using C code

import ctypes

cDoubleP=ctypes.POINTER(ctypes.c_double)
dll = ctypes.CDLL('./libdiag.so')
dll.method6.argtypes = (cDoubleP, cDoubleP, ctypes.c_int, ctypes.c_int)
dll.method6.restype = cDoubleP
def method6(A,B):
    n,m=B.shape
    dll.method6(A.ctypes.data_as(cDoubleP), B.ctypes.data_as(cDoubleP), n,m)
    return A

double * method6(double *A, double *B, int n, int m){
    int sta1=m*m;
    int sta2=m+1;
    int stb=m;
    for(int i=0; i<n; i++){
        for(int j=0; j<m; j++){
            A[sta1*i+sta2*j]=B[stb*i+j];
        }
    }
    return A;
}

But timings are not better than numba (which either tells something about how numba is efficient into compiling python in C, or about how my C code is not optimal): 0.98 ms. And yet, I am cheating a bit, since that makes the asumption that B array is contiguous (which it is, if it is generated as in your example. But wouldn't if B was the result of a fancy indexing, or of a reshape, ...). (I make the same asumption for A, but that one gains me nothing, since I could have passed the strides to the function, and it would have been the same code. It is the implicit 1 as stb2 that is a bit cheating).

A real improvement

At last, I finish with a solution other than enumerating solutions and telling "it is not better".

Reason why all other solutions, even tho there are just copies of elements, can't do better, as we have seen, is because of the distribution of elements in memory, that prevent taking advantage of cache. We have already seen that the importance of cache is really big (but that may be different on your machine, if your cache was already big enough to fit enough consecutive elements of diagonal).

So, let just ensure that our data are organized in memory in the same order that you use it in code.

A = np.random.randn(50,50,1000).transpose(2,0,1)
B = np.random.randn(1000,50)

So, same 1000x50x50 array. But this times, each 1000 elements A[*,i,j] for a given i,j are consecutive in memory. And since that is the order in which the copy will be made in the methods that do that, there, we get rid of the cache effect (I could have optimized even more by using a flat array, a bit bigger that 1000x50x50 bytes, inside which, with some as_strided, I pick values of A[i,j,k] in such order that all elments of all the diagonals are the 50000 first elements of the array. But it would probably not change much, since here, we already benefit from the cache 49 times of 50. the 50th alone would probably not change much).

On my computer, timings goes from around 1 ms, to 0.09 ms. So more than x10 gain.

But (of course, it comes with a "but"), you have to be aware that the speeding you get for this operation (playing with all elements of diagonal) is for the same reason why it would slow down some other operations, for example if you are iterating each matrix A[i,:,:] later. So, it really depends on what the rest of your code does. That ×10 gain may cost you more in the long run in some other parts of the code you haven posted about.

Answered By - chrslg

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Tuesday, January 23, 2024

[FIXED] NumPy - Fill each subdiagonal of an array, with rows from another array leveraging vectorisation

Issue

Benchmarking

Solution

`as_strided` solution

Comparison with the simplest copy of same amount of data

Numba

ctypes

A real improvement

0 comments:

Post a Comment

Popular Posts

Labels

Tuesday, January 23, 2024

Issue

Benchmarking

Solution

as_strided solution

Comparison with the simplest copy of same amount of data

Numba

ctypes

A real improvement

0 comments:

Post a Comment

Popular Posts

Labels

`as_strided` solution