Issue
I have a large matrix (like 100.000 x 100.000). The good thing it contains only zeros and ones and mostly zeros (it is already saved as boolean matrix to save some RAM). Now I need to multiply each column of the matrix with all of the other columns. The reason is that I need need to check whether there is as least one row where both columns have a non-zero element (therefore multiplying and summing the resulting vector to check whether it is zero or not). As an example assume we have a matrix
1.column | 2.column | 3.column |
---|---|---|
1 | 0 | 0 |
1 | 1 | 0 |
0 | 0 | 1 |
Then I need to compare all columns and check whether there is as least one row where both columns are one. So comparing the first and the second column would return a True since they are both one in the second row. However comparing the first and third column and the second the third column would result in a Falsesince there are no rows with a row where both are one. Obviously this can be done using a for loop and iterating over all columns. However not in a very satisfying speed. I already tried numba like this:
@njit(parallel=True)
def create_dist_arr(arr: np.array):
n = arr.shape[1]
dist_arr = np.zeros(shape=(n, n)) #, dtype=bool)
for i in prange(arr.shape[1]):
for j in prange(i, arr.shape[1]):
dist_greater_zero = calc_dist_graeter_than_zeros(arr[:, i], arr[:, j])
dist_arr[i][j] = dist_greater_zero
dist_arr[i][j] = dist_greater_zero
return skill_dist_arr
@njit
def calc_dist_graeter_than_zeros(ith_col, jth_col):
return np.sum(np.multiply(ith_col, jth_col)) != 0
zero_arr = np.zeros(shape=(2000, 6000), dtype=bool)
bool_dist_matrix = create_dist_arr(zero_arr)
But despite having 120gb Ram and 32 cores, that gets very slow around 10.000 x 10.000 matrices. Even worse is it when trying scipy.spatial.distance.pdist like this:
from scipy.spatial.distance import pdist
zero_arr = np.zeros(shape=(500, 500), dtype=bool)
bool_dist_matrix = pdist(zero_arr, lambda u, v: np.sum(np.multiply(u, v)) != 0)
Is there maybe some nice and fast workaround using sparse matrices or something else not taking like forever?
Thank you in advance :)
Solution
Idk how memory efficient this solution is nor if it will be faster than the other solutions but it is vectorized.
Your idea was to multiply the columns and add the result. which reminded me of matrix multiplication, except its against its own columns. so..
Lets say your matrix is M
. If you take the transverse M.T
and matrix multiply against itself, M.T @ M
it will be the same as multiplying each column against the other columns and taking the sum.
import numpy as np
# A random matrix
M = np.array([[1,0,0,0,0,1],
[1,1,0,0,0,1],
[0,1,1,1,0,0]])
bool_dist_matrix = (M.T @ M).astype('bool')
np.fill_diagonal(bool_dist_matrix, 0)
"""
[[0 1 0 0 0 1]
[1 0 1 1 0 1]
[0 1 0 1 0 0]
[0 1 1 0 0 0]
[0 0 0 0 0 0]
[1 1 0 0 0 0]]
"""
Answered By - Ta946
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.