Issue
I have an OpenCV image, as usual in BGR color space, and I need to convert it to CMYK. I searched online but found basically only (slight variations of) the following approach:
def bgr2cmyk(cv2_bgr_image):
bgrdash = cv2_bgr_image.astype(float) / 255.0
# Calculate K as (1 - whatever is biggest out of Rdash, Gdash, Bdash)
K = 1 - numpy.max(bgrdash, axis=2)
with numpy.errstate(divide="ignore", invalid="ignore"):
# Calculate C
C = (1 - bgrdash[..., 2] - K) / (1 - K)
C = 255 * C
C = C.astype(numpy.uint8)
# Calculate M
M = (1 - bgrdash[..., 1] - K) / (1 - K)
M = 255 * M
M = M.astype(numpy.uint8)
# Calculate Y
Y = (1 - bgrdash[..., 0] - K) / (1 - K)
Y = 255 * Y
Y = Y.astype(numpy.uint8)
return (C, M, Y, K)
This works fine, however, it feels quite slow - for an 800 x 600 px image it takes about 30 ms on my i7 CPU. Typical operations with cv2
like thresholding and alike take only a few ms for the same image, so since this is all numpy
I was expecting this CMYK conversion to be faster.
However, I haven't found anything that makes this significantly fater. There is a conversion to CMYK via PIL.Image
, but the resulting channels do not look as they do with the algorithm listed above.
Any other ideas?
Solution
There are several things you should do:
- shake the math
- use integer math where possible
- optimize beyond what numpy can do
Shaking the math
Given
RGB' = RGB / 255
K = 1 - max(RGB')
C = (1-K - R') / (1-K)
M = (1-K - G') / (1-K)
Y = (1-K - B') / (1-K)
You see what you can factor out.
RGB' = RGB / 255
J = max(RGB')
K = 1 - J
C = (J - R') / J
M = (J - G') / J
Y = (J - B') / J
Integer math
Don't normalize to [0,1]
for these calculations. The max()
can be done on integers. The differences can too. K
can be calculated entirely with integer math.
J = max(RGB)
K = 255 - J
C = 255 * (J - R) / J
M = 255 * (J - G) / J
Y = 255 * (J - B) / J
Numba
import numba
Numba will optimize that code beyond simply using numpy library routines. It will also do the parallelization as indicated. Choosing the numpy
error model and allowing fastmath
will cause division by zero to not throw an exception or warning, but also make the math a little faster.
Both variants significantly outperform a plain python/numpy solution. Much of that is due to better use of CPU registers caches, rather than intermediate arrays, as is usual with numpy.
First variant: ~1.9 ms
@numba.njit(parallel=True, error_model="numpy", fastmath=True)
def bgr2cmyk_v4(bgr_img):
bgr_img = np.ascontiguousarray(bgr_img)
(height, width) = bgr_img.shape[:2]
CMYK = np.empty((height, width, 4), dtype=np.uint8)
for i in numba.prange(height):
for j in range(width):
B,G,R = bgr_img[i,j]
J = max(R, G, B)
K = np.uint8(255 - J)
C = np.uint8(255 * (J - R) / J)
M = np.uint8(255 * (J - G) / J)
Y = np.uint8(255 * (J - B) / J)
CMYK[i,j] = (C,M,Y,K)
return CMYK
Thanks to Cris Luengo for pointing out further refactoring potential (pulling out 255/J
), leading to a second variant. It takes ~1.6 ms
@numba.njit(parallel=True, error_model="numpy", fastmath=True)
def bgr2cmyk_v5(bgr_img):
bgr_img = np.ascontiguousarray(bgr_img)
(height, width) = bgr_img.shape[:2]
CMYK = np.empty((height, width, 4), dtype=np.uint8)
for i in numba.prange(height):
for j in range(width):
B,G,R = bgr_img[i,j]
J = np.uint8(max(R, G, B))
Jinv = np.uint16((255*256) // J) # fixed point math
K = np.uint8(255 - J)
C = np.uint8(((J - R) * Jinv) >> 8)
M = np.uint8(((J - G) * Jinv) >> 8)
Y = np.uint8(((J - B) * Jinv) >> 8)
CMYK[i,j] = (C,M,Y,K)
return CMYK
This fixed point math causes floor rounding. For round-to-nearest, the expression must be ((J - R) * Jinv + 128) >> 8
. That would cost a bit more time then (~1.8 ms).
What else?
I think that numba/LLVM didn't apply SIMD here. Some investigation revealed that the Loop Vectorizer doesn't like any of the instances it was asked to consider.
An OpenCL kernel might be even faster. OpenCL can run on CPUs.
Answered By - Christoph Rackwitz
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.