Issue
I am trying to create a huge boolean
matrix which is randomly filled with True
and False
with a given probability p
. At first I used this code:
N = 30000
p = 0.1
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
But sadly it does not seem to terminate for this big N
. So I tried to split it up into the generation of the single rows by doing this:
N = 30000
p = 0.1
mask = np.empty((N, N))
for i in range (N):
mask[i] = np.random.choice(a=[False, True], size=N, p=[p, 1-p])
if (i % 100 == 0):
print(i)
Now, there happens something strange (at least on my device): The first ~1100 rows are very fastly generated - but after it, the code becomes horribly slow. Why is this happening? What do I miss here? Are there better ways to create a big matrix which has True
entries with probability p
and False
entries with probability 1-p
?
Edit: As many of you assumed that the RAM will be a problem: As the device which will run the code has almost 500GB RAM, this won't be a problem.
Solution
Really surprised no one has mentioned this solution yet..
This line
np.random.choice(a=[False, True], size=(N, N), p=[p, 1-p])
runs NXN Bernoulli Trials. (In your case, 900M of them!) A Bernoulli trial is just a random experiment with two possible outcomes, with probabilities p and 1-p.
The sum of N Bernoulli trials, each with probability p, can be modeled by the Binomial distribution.
We can leverage this fact to randomly simulate the total count of True elements. With NumPy,
import numpy as np
N = 30000
p = 0.1
# Build a random number generator
rng = np.random.default_rng(123)
# Randomly determine the total number of True values
Ntrue = rng.binomial(n=N*N, p=p, size=1)[0] # 90016776
Now we can randomly determine the position of each True element by randomly choosing row and col indices without replacement.
# Randomly determine true position
position_ids = rng.choice(a=N*N, size=Ntrue, replace=False)
positions = np.unravel_index(position_ids, shape=(N,N))
And now we can populate a compressed sparse row (CSR) matrix.
from scipy import sparse
# Build a compressed sparse row matrix with the constructor:
# csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
result = sparse.csr_matrix((np.ones(shape=Ntrue), positions), shape=(N,N))
Notice this solution avoids storing and computing 900M boolean values.
Funny enough, I wrote about a nearly identical problem before stumbling upon this question.
Answered By - Ben
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.