Monday, January 17, 2022

[FIXED] Find all close numerical matches in two 2D arrays

January 17, 2022 match, numpy, python No comments

Issue

Update: I made the solution into a library called close-numerical-matches.

I am looking for a way to find all close matches (within some tolerance) between two 2D arrays and get an array of the indices of the found matches. Multiple answers on SO show how to solve this problem for exact matches (typically with a dictionary), but that is not what I am looking for. Let me give an example:

>>> arr1 = [
    [19.21, 19.19],
    [13.18, 11.55],
    [21.45,  5.83]
]
>>> arr2 = [
    [13.11, 11.54],
    [19.20, 19.19],
    [51.21, 21.55],
    [19.22, 19.18],
    [11.21, 11.55]
]
>>> find_close_match_indices(arr1, arr2, tol=0.1)
[[0, 1], [0, 3], [1, 0]]

Above, [[0, 1], [0, 3], [1, 0]] is returned because element 0 in arr1, [19.21, 19.19] is within tolerance to elements 1 and 3 in arr2. Order is not important to me, i.e. [[0, 3], [1, 0], [0, 1]] would be just as acceptable.

The shape of arr1 is (n, 2) and arr2 is (m, 2). You can expect that n and m will be huge. Now, I can easily implement this using a nested for loop but I am sure there must be some smarter way than comparing every element against all other elements.

I thought about using k-means clustering to divide the problem into k buckets and thus make the nested for-loop approach more tractable, but I think there may be a small risk two close elements are just at the "border" of each of their clusters and therefore wouldn't get compared.

Any external dependencies such as Numpy, Scipy, etc. are fine and it is fine as well as to use O(n + m) space.

Solution

I got an idea for how to use buckets to solve this problem. The idea is that a key is formed based on the values of the elements and the tolerance level. To make sure potential matches that were in the "edge" of the bucket are compared against other element at "edges", all neighbour buckets are compared. Finally, I modified @Tim Roberts' approach for performing the actual matching slightly to match on both columns.

I made this into a library called close-numerical-matches. Sample usage:

>>> import numpy as np
>>> from close_numerical_matches import find_matches
>>> arr0 = np.array([[25, 24], [50, 50], [25, 26]])
>>> arr1 = np.array([[25, 23], [25, 25], [50.6, 50.6], [60, 60]])
>>> find_matches(arr0, arr1, tol=1.0001)
array([[0, 0], [0, 1], [1, 2], [2, 1]])
>>> find_matches(arr0, arr1, tol=0.9999)
array([[1, 2]])
>>> find_matches(arr0, arr1, tol=0.60001)
array([], dtype=int64)
>>> find_matches(arr0, arr1, tol=0.60001, dist='max')
array([[1, 2]])
>>> manhatten_dist = lambda arr: np.sum(np.abs(arr), axis=1)
>>> matches = find_matches(arr0, arr1, tol=0.11, dist=manhatten_dist)
>>> matches
array([[0, 1], [0, 1], [2, 1]])
>>> indices0, indices1 = matches.T
>>> arr0[indices0]
array([[25, 24], [25, 24], [25, 26]])

Some profiling:

from timeit import default_timer as timer
import numpy as np
from close_numerical_matches import naive_find_matches, find_matches

arr0 = np.random.rand(320_000, 2)
arr1 = np.random.rand(44_000, 2)

start = timer()
naive_find_matches(arr0, arr1, tol=0.001)
end = timer()
print(end - start)  # 255.335 s

start = timer()
find_matches(arr0, arr1, tol=0.001)
end = timer()
print(end - start)  # 5.821 s

Answered By - shmulvad

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 17, 2022

[FIXED] Find all close numerical matches in two 2D arrays

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels