Issue
I need to compute the pairwise distances between two sets of vectors, but I only need to know a small proportion of those pairwise distances. This proportion will be less than 0.01 when my two sets of vectors are large enough.
This is my current solution with a small example. A
and B
are the two sets of vectors, and M
stores the pairs of vectors I need to retain. The number of rows in both A
and B
will typically be much larger (several thousand). It might also be worth mentioning that M
has at least one nonzero value in each row, but some columns may only contain zeros.
import numpy as np
A = np.array([[1, 2], [2, 3], [3, 4]]) # (3, 2)
B = np.array([[4, 5], [5, 6], [6, 7], [7, 8], [8, 9]]) # (5, 2)
M = np.array([[0, 0, 0, 1, 0], [1, 1, 0, 0, 0], [0, 0, 0, 0, 1]]) # (3, 5)
diff = A[:,None] - B[None,:] # (3, 5, 2)
distances = np.linalg.norm(diff, ord=2, axis=2) # (3, 5)
masked_distances = distances * M # (3, 5)
This code works, but it calculates a lot of norms that I don't need, so it's performing a lot of unnecessary computation, especially as A
and B
become larger. Is there a more efficient way to only compute the norms that I need? I'm open to suggestions for other packages.
I have thought about using sparse arrays to compute masked_distances
, but I'm having problems with sparse arrays in scipy with more than 2 dimension (i.e. diff
).
I've also tried creating a vectorised function. This also works, but it's even slower when A
and B
grow to have thousands of records.
def conditional_diff(a, b, m):
if m:
return a-b
else:
return 0
conditional_diff_vec = np.vectorize(conditional_diff, otypes=[float])
masked_diff = conditional_diff_vec(A[:,None], B[None,:], M[:,:,None])
masked_distances = norm(masked_diff, ord=2, axis=2)
Solution
I would solve this by using numba to construct a Compressed Sparse Row matrix. The CSR matrix allows me to avoid using memory to represent the many zeros in the matrix.
Numba allows me to use an explicit loop without losing too much performance. This allows me to use an if
to avoid computing unused distances.
import numba as nb
import numpy as np
import scipy
import math
A_big = np.random.rand(2000, 10)
B_big = np.random.rand(4000, 10)
M_big = np.random.rand(A_big.shape[0], B_big.shape[0]) < 0.001
@nb.njit()
def euclidean_distance(vec_a, vec_b):
acc = 0.0
for i in range(vec_a.shape[0]):
acc += (vec_a[i] - vec_b[i]) ** 2
return math.sqrt(acc)
@nb.njit()
def masked_distance_inner(data, indicies, indptr, matrix_a, matrix_b, mask):
write_pos = 0
N, M = matrix_a.shape[0], matrix_b.shape[0]
for i in range(N):
for j in range(M):
if mask[i, j]:
# Record distance value
data[write_pos] = euclidean_distance(matrix_a[i], matrix_b[j])
# Record what column this distance value should be in
indicies[write_pos] = j
write_pos += 1
# Record how many entries are in this row
indptr[i + 1] = write_pos
# Assert that we wrote to every part of un-initialized memory
assert write_pos == data.shape[0]
assert write_pos == indicies.shape[0]
# Data is returned by modifying parameters to function
def masked_distance(matrix_a, matrix_b, mask):
N, M = matrix_a.shape[0], matrix_b.shape[0]
assert mask.shape == (N, M)
# Convert mask to bool
mask = mask != 0
# How many entries will this sparse array have?
sparse_length = mask.sum()
# Set up CSR matrix
# data and indicies do not need to be zero-initialized,
# and it is faster not to initialize them
# Holds data values
data = np.empty(sparse_length, dtype='float64')
# Holds column that each data value is in
indicies = np.empty(sparse_length, dtype='int64')
# Holds the position of the elements of each row within data and indicies
indptr = np.zeros(N + 1, dtype='int64')
masked_distance_inner(data, indicies, indptr, matrix_a, matrix_b, mask)
return scipy.sparse.csr_matrix((data, indicies, indptr), shape=(N, M))
%timeit masked_distance(A_big, B_big, M_big)
A few notes:
I found that calculating the euclidean distance in numba was faster than
np.linalg.norm(A[i] - B[i])
, and still returned the same result. That's why theeuclidean_distance()
function is present.masked_distance()
is responsible for setting up the algorithm.masked_distance_inner()
is all of the speed-critical stuff.I benchmarked this versus the algorithm in the question, and found that for an input of A with shape (2000, 10) and B (4000, 10), with M having 0.1% elements set to True, it was roughly 40x faster.
13.5 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) 556 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The exact speedup will depend on the sizes of the matrices involved and how sparse the mask is. For example, I benchmarked the same problem with vectors of length 1000, and found a 1000x speedup.
I checked the correctness of this algorithm against the original using
np.allclose()
.You could probably make this faster by using smaller dtypes. You could replace
float64
withfloat32
if you only need 7 digits of accuracy for the distances. You could replaceint64
withint32
if you know that the dimensions of the matrix are smaller than 231 and that there will never be more than 231 defined elements.
Answered By - Nick ODell
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.