Thursday, February 1, 2024

[FIXED] Can NumPy replace these list comprehensions to make it run faster?

February 01, 2024 algorithm, graphics, linear-algebra, numpy, python No comments

Issue

Can this matrix math be done faster?

I'm using Python to render 3D points in perspective. Speed is important, because it will translate directly to frame rate at some point down the line.

I tried to use NumPy functions, but I couldn't clear out two pesky list comprehensions. 90% of my program's runtime is during the list comprehensions, which makes sense since they contain all of the math, so I want to find a faster method if possible.

The first list comprehension happens when making pos- it does a sum and matrix multiplication for every individual row of vert_array
The second one, persp, multiplies the x and y values of each row based on that specific row's z value.

Can replace those list comprehensions with something from NumPy? I read about numpy.einsum and numpy.fromfunction, but I was struggling to understand if they're even relevant to my problem.

Here is the function that does the main rendering calculations:

I want to make pos and persp faster:

import time
from random import randint
import numpy as np

def render_all_verts(vert_array):
    """
    :param vert_array: a 2-dimensional numpy array of float32 values and
        size 3 x n, formatted as follows, where each row represents one
        vertex's coordinates in world-space coordinates:
        [[vert_x_1, vert_y_1, vert_z_1],
         [vert_x_2, vert_y_2, vert_z_2],
         ...
         [vert_x_n, vert_y, vert_z]]
    :return: a 2-dimensional numpy array of the same data type, size
        and format as vert_array, but in screen-space coordinates
    """
    # Unit Vector is a 9 element, 2D array that represents the rotation matrix
    # for the camera after some rotation (there's no rotation in this example)
    unit_vec = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype='float32')

    # Raw Shift is a 3 element, 1D array that represents the position
    # vector (x, y, z) of the camera in world-space coordinates
    shift = np.array([0, 0, 10], dtype='float32')
    
    # PURPOSE: This converts vert_array, with its coordinates relative
    #   to the world-space axes and origin, into coordinates relative
    #   to camera-space axes and origin (at the camera).
    # MATH DESCRIPTION: For every row, raw_shift is added, then matrix
    #   multiplication is performed with that sum (1x3) and unit_array (3x3).
    pos = np.array([np.matmul(unit_vec, row + shift) for row in vert_array], dtype='float32')
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    # This is a constant used to solve for the perspective
    focus = 5
    # PURPOSE: This calculation does the math to change the vertex coordinates,
    #   which are relative to the camera, into a representation of how they'll
    #   appear on screen in perspective. The x and y values are scaled based on
    #   the z value (distance from the camera)
    # MATH DESCRIPTION: Each row's first two columns are multiplied
    #   by a scalar, which is derived from that row's third column value.
    persp = np.array([np.multiply(row, np.array([focus / abs(row[2]), focus / abs(row[2]), 1])) for row in pos])
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    return persp

I wrote this code to time render_all_verts and generate an list of random vertex coordinates to run through it repeatedly.

# TESTING RENDERING SPEED
start_time = time.time()

# The next few lines make an array, similar to the 3D points I'd be rendering.
# It contains n vertices with random coordinate values from -m to m
n = 1000
m = 50
example_vertices = np.array([(randint(-m, m), randint(-m, m), randint(-m, m)) for i in range(n)])
# This empty array, which is the same shape as example_vertices. The results are saved here.
rendered_verts = np.empty(example_vertices.shape)

print('Example vertices:\n', example_vertices)

# This loop will render the example vertices many times
render_times = 2000
for i in range(render_times):
    rendered_verts = render_all_verts(example_vertices)
    
print('\n\nLast calculated render of vertices:\n', rendered_verts)

print(f'\n\nThis program took an array of {n} vertices with randomized coordinate')
print(f'values between {-m} and {m} and rendered them {render_times} times.')
print(f'--- {time.time() - start_time} seconds ---')

Finally, here's one instance of the terminal output:

C:\...\simplified_demo.py 
Example vertices:
 [[-45   4 -43]
 [ 42  27  28]
 [-33  24 -18]
 ...
 [ -5  48   5]
 [-17 -17  29]
 [ -5 -46 -24]]
C:\...\simplified_demo.py:45: RuntimeWarning: divide by zero encountered in divide
  persp = np.array([np.multiply(row, np.array([focus / abs(row[2]), focus / abs(row[2]), 1]))


Last calculated render of vertices:
 [[ -6.81818182   0.60606061 -33.        ]
 [  5.52631579   3.55263158  38.        ]
 [-20.625       15.          -8.        ]
 ...
 [ -1.66666667  16.          15.        ]
 [ -2.17948718  -2.17948718  39.        ]
 [ -1.78571429 -16.42857143 -14.        ]]


This program took an array of 1000 vertices with randomized coordinate
values between -50 and 50 and rendered them 2000 times.
--- 15.910243272781372 seconds ---

Process finished with exit code 0

P.S. NumPy seems to handle division by zero and overflow values fine for now, so I'm not worried about the Runtimewarning. I replacced my file paths with ...

P.P.S. Yes, I know I could just use OpenGL or any other existing rendering engine that would already handle all this math, but I'm more interested in reinventing this wheel. It's mostly an experiment for me to study Python and NumPy.

Solution

An intial speedup can be made by using vectorization


def render_all_verts(vert_array):
    """
    :param vert_array: a 2-dimensional numpy array of float32 values and
        size 3 x n, formatted as follows, where each row represents one
        vertex's coordinates in world-space coordinates:
        [[vert_x_1, vert_y_1, vert_z_1],
         [vert_x_2, vert_y_2, vert_z_2],
         ...
         [vert_x_n, vert_y, vert_z]]
    :return: a 2-dimensional numpy array of the same data type, size
        and format as vert_array, but in screen-space coordinates
    """
    # Unit Vector is a 9 element, 2D array that represents the rotation matrix
    # for the camera after some rotation (there's no rotation in this example)
    unit_vec = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype='float32')

    # Raw Shift is a 3 element, 1D array that represents the position
    # vector (x, y, z) of the camera in world-space coordinates
    shift = np.array([0, 0, 10], dtype='float32')
    
    # PURPOSE: This converts vert_array, with its coordinates relative
    #   to the world-space axes and origin, into coordinates relative
    #   to camera-space axes and origin (at the camera).
    # MATH DESCRIPTION: For every row, raw_shift is added, then matrix
    #   multiplication is performed with that sum (1x3) and unit_array (3x3).
    pos2 = np.matmul(unit_vec, (vert_array + shift).T).T
    """
    pos = np.array([np.matmul(unit_vec, row + shift) for row in vert_array], dtype='float32')
    print(vert_array.shape, unit_vec.shape)
    assert pos2.shape == pos.shape, (pos2.shape, pos.shape)
    assert np.all(pos2 == pos), np.sum(pos - pos2)
    """
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    # This is a constant used to solve for the perspective
    focus = 5
    # PURPOSE: This calculation does the math to change the vertex coordinates,
    #   which are relative to the camera, into a representation of how they'll
    #   appear on screen in perspective. The x and y values are scaled based on
    #   the z value (distance from the camera)
    # MATH DESCRIPTION: Each row's first two columns are multiplied
    #   by a scalar, which is derived from that row's third column value.
    x = focus / np.abs(pos2[:,2])
    persp2 = np.multiply(pos2, np.dstack([x, x, np.ones(x.shape)]))
    """
    persp = np.array([np.multiply(row, np.array([focus / abs(row[2]), focus / abs(row[2]), 1])) for row in pos2])
    assert persp.shape == persp2.shape, (persp.shape, persp2.shape)
    assert np.all(persp == persp2), np.sum(persp - persp2)
    """
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    
    return persp2

this runs in 0.18s

then we can remove the un-needed transpose parts for 50% more perf

def render_all_verts_2(vert_array):
    unit_vec = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype='float32')

    shift = np.array([0, 0, 10], dtype='float32')
    pos2 = np.matmul(unit_vec, (vert_array + shift).T)
    focus = 5
    x = focus / np.abs(pos2[2])
    persp2 = np.multiply(pos2, np.vstack([x, x, np.ones(x.shape)]))
    return persp2.T[np.newaxis,]

this version runs in 0.12 seconds on my system

finally we can use numba for a 4x speedup to 0.03s

from numba import njit
@njit
def render_all_verts_2(vert_array):
    unit_vec = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype=np.float32)

    shift = np.array([0, 0, 10], dtype=np.float32)
    pos2 = np.dot(unit_vec, (vert_array + shift).T)
    focus = 5
    x = focus / np.abs(pos2[2])
    persp2 = np.multiply(pos2, np.vstack((x, x, np.ones(x.shape, dtype=np.float32))))
    return persp2.T.reshape((1, persp2.shape[1], persp2.shape[0]))

this is 800 times faster than the 26 seconds it was taking before on my machine.

Even more optimized as suggested by @Nin17 (but I couldn't see any impact) Also removed the 3d reshape. Somewhere while testing I asserted that the shape should match 3d, this is changed.

# Best njit version with transpose - 0.03
@njit
def render_all_verts(vert_array):
    unit_vec = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype=np.float32)
    shift = np.array([0, 0, 10], dtype=np.float32)
    focus = 5
    data = (vert_array + shift).T
    data = np.dot(unit_vec, data)
    data[:2] *= focus / np.abs(data[2:3])
    return data.T

# Best normal version wiht transpose - 0.10
def render_all_verts(vert_array):
    unit_vec = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype=np.float32)
    shift = np.array([0, 0, 10], dtype=np.float32)
    focus = 5
    data = (vert_array + shift).T
    data = np.dot(unit_vec, data)
    data[:2] *= focus / np.abs(data[2:3])
    return data.T

# Without transpose is slower, probably because of BLAS implementation / memory contiguity etc.
# Without transpose normal - 0.14s (slowest)
def render_all_verts(vert_array):
    unit_vec = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype=np.float32)
    shift = np.array([0, 0, 10], dtype=np.float32)
    focus = 5
    pos2 = (vert_array + shift)
    pos2 = np.dot(pos2, unit_vec)
    pos2[:,:2] *= focus/np.abs(pos2[:,2:3])
    return pos2

# njit without transpose (second fastest) - 0.06s
@njit
def render_all_verts_2(vert_array):
    unit_vec = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype=np.float32)
    shift = np.array([0, 0, 10], dtype=np.float32)
    focus = 5
    pos2 = (vert_array + shift)
    pos2 = np.dot(pos2, unit_vec)
    pos2[:,:2] *= focus/np.abs(pos2[:,2:3])
    return pos2

also fastest implementation in plain python that I could do (using numba non supported features) - it is basically on par with numba for larger vertices, and faster than numba for even larger vertices.

def render_all_verts(vert_array):
    unit_vec = np.array([[1, 0, 0],
                         [0, 1, 0],
                         [0, 0, 1]], dtype=np.float32)
    shift = np.array([0, 0, 10], dtype=np.float32)
    focus = 5
    data = vert_array.T.astype(np.float32, order='C')
    data += shift[:,np.newaxis]
    np.dot(unit_vec, data, out=data)
    data[:2] *= focus / np.abs(data[2:3])
    return data.T

Answered By - arrmansa

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, February 1, 2024

[FIXED] Can NumPy replace these list comprehensions to make it run faster?

Issue

Can this matrix math be done faster?

Here is the function that does the main rendering calculations:

Solution

0 comments:

Post a Comment

Popular Posts

Labels