Wednesday, October 6, 2021

[FIXED] Adding 2d Numy array to 1d Numpy Array

October 06, 2021 arrays, numpy, python No comments

Issue

I am having a python list, and each element is a 2d Numpy array with size of (20, 22). I need to convert the list to a numpy array but doing np.array(my_list) is literally eating the RAM, so does np.asarray(my_list).

The list has around 7M samples, I was thinking instead of converting my list to a numpy array, let me start with a numpy array and keep appending another 2d numpy arrays.

I cant find a way of doing that using numpy, my aim is to start with something like that:

numpy_array = np.array([])

df_values = df.to_numpy() # faster than df.values
for x in df_values:
    if condition:
        start_point += 20
        end_point += 20
    features = df_values[start_point:end_point] # 20 rows, 22 columns
    np.append(numpy_array, features)

As you can see above, after each loop, the size of numpy_array should be changing to something like this:

first iteration: (1, 20, 22) 
second iteration: (2, 20, 22) 
third iteration: (3, 20, 22) 
N iteration: (N, 20, 22)

Update:

Here is my full code,

def get_X(df_values):
    x = [] #np.array([], dtype=np.object)
    y = [] # np.array([], dtype=int32)
    counter = 0
    start_point = 20
    previous_ticker = None
    index = 0
    time_1 = time.time()
    df_length = len(df_values)
    for row in tqdm(df_values):
        if 0 <= start_point < df_length:
            ticker = df_values[start_point][0]
            flag = row[30]
            if index == 0: previous_ticker = ticker
            if ticker != previous_ticker:
                counter += 20 
                start_point += 20
                previous_ticker = ticker
            features = df_values[counter:start_point]
            x.append(features)
            y.append(flag)
            # np.append(x, features)
            # np.append(y, flag)
            counter += 1
            start_point += 1
            index += 1
        else:
            break
    print("Time to finish the loop", time.time()-time_1)
    return x, y


x, y = get_X(df.to_numpy())

Solution

Numpy arrays are so efficent because they have a fixed size and type. Hence, "appending" to an array is very slow and consuming, because a new array is created all the time. If you know beforehand how many samples you have (eg 7000000) the best way is:

N = 7000000
# Make complete array with NaN's
features = np.empty(size=(N, 20, 22), dtype=np.float64) * np.NaN
for whatever:
    ...
    features[counter:start_point] = ...

Should be the fastest and most memory efficiant way, when using a loop. However, this looks like some transformation of a dataframe into the 3D array, which might be much, much faster solved with pandas numerous features for transformation.

If you do not know the final size, err on the bigger size and copy it once to the smaller (correct) size.

Answered By - oekopez

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 6, 2021

[FIXED] Adding 2d Numy array to 1d Numpy Array

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels