Issue
I am trying to optimize the following python code with cython:
from cython cimport boundscheck, wraparound
@boundscheck(False)
@wraparound(False)
def cython_color2gray(numpy.ndarray[numpy.uint8_t, ndim=3] image):
cdef int x,y,z
cdef double z_val, grey
for x in range(len(image)):
for y in range(len(image[x])):
grey = 0
for z in range(len(image[x][y])):
if z == 0:
z_val = image[x][y][0] * 0.21
grey += z_val
elif z == 1:
z_val = image[x][y][1] * 0.07
grey += z_val
elif z == 2:
z_val = image[x][y][2] * 0.72
grey += z_val
image[x][y][0] = grey
image[x][y][1] = grey
image[x][y][2] = grey
return image
However, when checking if everything is as optimized as it should be, I receive the following yellow lines (see picture). Is there anything else I can do to optimize this cython code and make it run faster?
Solution
Here are some key points:
The
len()
function is a Python function and has measurable overhead. Sinceimage
is annp.ndarray
anyway, prefer the.shape
attribute to get the number of elements in each dimension.Consider using
image[i, j, k]
instead ofimage[i][j][k]
for element access.Prefer typed memoryviews, since the syntax is cleaner and they are faster. For instance, the equivalent memoryview of
numpy.ndarray[T, ndim=3]
isT[:, :, :]
, whereT
denotes the type of the data elements. If you know that your array's memory layout is C-contiguous, you can specify the layout by usingT[:, :, ::1]
. In C,unsigned char
is the smallest unsigned integer type with 8 bits width (on most modern platforms) and thus equivalent tonp.uint8_t
. Therefore, yournumpy.ndarray[numpy.uint8_t, ndim=3] image
becomesunsigned char[:, :, ::1] image
, given thatimage
's data is C-contiguous. Alternatively, you could useuint8_t[:, :, ::1]
aftercimport
ing the C typeuint8_t
fromlibc.stdint
.The variable
grey
is a double while the elements ofimage
arenp.uint8
(equivalent to unsigned char). So when doingimage[i,j,k]=grey
in Python,grey
gets casted to an unsigned char, i.e. the decimal digits are cut off. In Cython, you have to do the cast manually.After you know your code works as expected, you can further accelerate it with directives for the Cython compiler, e.g. deactivating the bounds checks and negative indices (wraparound). Note that these are decorators that need to be imported.
And your code snippet becomes:
from cython cimport boundscheck, wraparound
@boundscheck(False)
@wraparound(False)
def cython_color2gray(unsigned char[:, :, ::1] image):
cdef int x,y,z
cdef double z_val, grey
for x in range(image.shape[0]):
for y in range(image.shape[1]):
grey = 0
for z in range(image.shape[2]):
if z == 0:
z_val = image[x, y, 0] * 0.21
grey += z_val
elif z == 1:
z_val = image[x, y, 1] * 0.07
grey += z_val
elif z == 2:
z_val = image[x, y, 2] * 0.72
grey += z_val
image[x, y, :] = <unsigned char> grey
return image
Looking closely, you'll see that there's no need for the most inner loop:
from cython cimport boundscheck, wraparound
@boundscheck(False)
@wraparound(False)
def cython_color2gray(unsigned char[:, :, ::1] image):
cdef int x, y
for x in range(image.shape[0]):
for y in range(image.shape[1]):
image[x, y, :] = <unsigned char>(image[x,y,0]*0.21 + image[x,y,1]*0.07 + image[x,y,2] * 0.72)
return image
Going one step further, you can try to accelerate Cython's generated C code by enabling your C compiler's auto-vectorization (in the sense of SIMD). For gcc/clang you can use the flags -O3
and -march=native
. For MSVC it's /O2
and /arch:AVX2
(assuming your machine supports AVX2). If you're working inside a jupyter notebook, you can pass C compiler flags via the -c=YOURFLAG
argument for the Cython magic, i.e.
%%cython -a -f -c=-O3 -c=-march=native
# your cython code here..
Answered By - joni
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.