Issue
Update: I made the solution into a library called close-numerical-matches.
I am looking for a way to find all close matches (within some tolerance) between two 2D arrays and get an array of the indices of the found matches. Multiple answers on SO show how to solve this problem for exact matches (typically with a dictionary), but that is not what I am looking for. Let me give an example:
>>> arr1 = [
[19.21, 19.19],
[13.18, 11.55],
[21.45, 5.83]
]
>>> arr2 = [
[13.11, 11.54],
[19.20, 19.19],
[51.21, 21.55],
[19.22, 19.18],
[11.21, 11.55]
]
>>> find_close_match_indices(arr1, arr2, tol=0.1)
[[0, 1], [0, 3], [1, 0]]
Above, [[0, 1], [0, 3], [1, 0]]
is returned because element 0 in arr1
, [19.21, 19.19] is within tolerance to elements 1 and 3 in arr2
. Order is not important to me, i.e. [[0, 3], [1, 0], [0, 1]]
would be just as acceptable.
The shape of arr1
is (n, 2) and arr2
is (m, 2). You can expect that n and m will be huge. Now, I can easily implement this using a nested for loop but I am sure there must be some smarter way than comparing every element against all other elements.
I thought about using k-means clustering to divide the problem into k buckets and thus make the nested for-loop approach more tractable, but I think there may be a small risk two close elements are just at the "border" of each of their clusters and therefore wouldn't get compared.
Any external dependencies such as Numpy, Scipy, etc. are fine and it is fine as well as to use O(n + m) space.
Solution
I got an idea for how to use buckets to solve this problem. The idea is that a key is formed based on the values of the elements and the tolerance level. To make sure potential matches that were in the "edge" of the bucket are compared against other element at "edges", all neighbour buckets are compared. Finally, I modified @Tim Roberts' approach for performing the actual matching slightly to match on both columns.
I made this into a library called close-numerical-matches. Sample usage:
>>> import numpy as np
>>> from close_numerical_matches import find_matches
>>> arr0 = np.array([[25, 24], [50, 50], [25, 26]])
>>> arr1 = np.array([[25, 23], [25, 25], [50.6, 50.6], [60, 60]])
>>> find_matches(arr0, arr1, tol=1.0001)
array([[0, 0], [0, 1], [1, 2], [2, 1]])
>>> find_matches(arr0, arr1, tol=0.9999)
array([[1, 2]])
>>> find_matches(arr0, arr1, tol=0.60001)
array([], dtype=int64)
>>> find_matches(arr0, arr1, tol=0.60001, dist='max')
array([[1, 2]])
>>> manhatten_dist = lambda arr: np.sum(np.abs(arr), axis=1)
>>> matches = find_matches(arr0, arr1, tol=0.11, dist=manhatten_dist)
>>> matches
array([[0, 1], [0, 1], [2, 1]])
>>> indices0, indices1 = matches.T
>>> arr0[indices0]
array([[25, 24], [25, 24], [25, 26]])
Some profiling:
from timeit import default_timer as timer
import numpy as np
from close_numerical_matches import naive_find_matches, find_matches
arr0 = np.random.rand(320_000, 2)
arr1 = np.random.rand(44_000, 2)
start = timer()
naive_find_matches(arr0, arr1, tol=0.001)
end = timer()
print(end - start) # 255.335 s
start = timer()
find_matches(arr0, arr1, tol=0.001)
end = timer()
print(end - start) # 5.821 s
Answered By - shmulvad
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.