Issue
I am having a python list, and each element is a 2d Numpy array with size of (20, 22)
. I need to convert the list to a numpy array but doing np.array(my_list)
is literally eating the RAM, so does np.asarray(my_list)
.
The list has around 7M samples, I was thinking instead of converting my list to a numpy array, let me start with a numpy array and keep appending another 2d numpy arrays.
I cant find a way of doing that using numpy, my aim is to start with something like that:
numpy_array = np.array([])
df_values = df.to_numpy() # faster than df.values
for x in df_values:
if condition:
start_point += 20
end_point += 20
features = df_values[start_point:end_point] # 20 rows, 22 columns
np.append(numpy_array, features)
As you can see above, after each loop, the size of numpy_array
should be changing to something like this:
first iteration: (1, 20, 22)
second iteration: (2, 20, 22)
third iteration: (3, 20, 22)
N iteration: (N, 20, 22)
Update:
Here is my full code,
def get_X(df_values):
x = [] #np.array([], dtype=np.object)
y = [] # np.array([], dtype=int32)
counter = 0
start_point = 20
previous_ticker = None
index = 0
time_1 = time.time()
df_length = len(df_values)
for row in tqdm(df_values):
if 0 <= start_point < df_length:
ticker = df_values[start_point][0]
flag = row[30]
if index == 0: previous_ticker = ticker
if ticker != previous_ticker:
counter += 20
start_point += 20
previous_ticker = ticker
features = df_values[counter:start_point]
x.append(features)
y.append(flag)
# np.append(x, features)
# np.append(y, flag)
counter += 1
start_point += 1
index += 1
else:
break
print("Time to finish the loop", time.time()-time_1)
return x, y
x, y = get_X(df.to_numpy())
Solution
Numpy arrays are so efficent because they have a fixed size and type. Hence, "appending" to an array is very slow and consuming, because a new array is created all the time. If you know beforehand how many samples you have (eg 7000000) the best way is:
N = 7000000
# Make complete array with NaN's
features = np.empty(size=(N, 20, 22), dtype=np.float64) * np.NaN
for whatever:
...
features[counter:start_point] = ...
Should be the fastest and most memory efficiant way, when using a loop. However, this looks like some transformation of a dataframe into the 3D array, which might be much, much faster solved with pandas numerous features for transformation.
If you do not know the final size, err on the bigger size and copy it once to the smaller (correct) size.
Answered By - oekopez
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.