Issue
I'm trying to optimize the performance of my python program and I think I have identified this piece of code as bottleneck:
for i in range(len(green_list)):
rgb_list = []
for j in range(len(green_list[i])):
rgb_list.append('%02x%02x%02x' % (red_list[i][j], green_list[i][j], blue_list[i][j]))
write_file(str(i), rgb_list)
Where red_list, green_list and blue_list are numpy arrays with values like this:
red_list = [[1, 2, 3, 4, 5], [51, 52, 53, 54, 55]]
green_list = [[6, 7, 8, 9, 10], [56, 57, 58, 59, 60]]
blue_list = [[11, 12, 13, 14, 15], [61, 62, 63, 64, 65]]
At the end of each execution of the inner-for rgb_list is containing the hex values:
rgb_list = ['01060b', '02070c', '03080d', '04090e', '050a01']
Now, it is not clear to me how to exploit the potential of numpy arrays but I think there is a way to optimize those two nested loops. Any suggestions?
Solution
I assume the essential traits of your code could be summarized in the following generator:
import numpy as np
def as_str_OP(r_arr, g_arr, b_arr):
n, m = r_arr.shape
rgbs = []
for i in range(n):
rgb = []
for j in range(m):
rgb.append('%02x%02x%02x' % (r_arr[i, j], g_arr[i, j], b_arr[i, j]))
yield rgb
which can be consumed with a for
loop, for example to write to disk:
for x in as_str_OP(r_arr, g_arr, b_arr):
write_to_disk(x)
The generator itself can be written either with the core computation vectorized in Python or in a Numba-friendly way. The key is to replace the relatively slow string interpolation with a int-to-hex custom-made computation.
This results in substantial speed-up, especially as the size of the input grows (and particularly the second dimension).
Below is the NumPy-vectorized version:
def as_str_np(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
rgbs = []
for i in range(n):
rgb = np.empty((m, 2 * l), dtype=np.uint32)
r0, r1 = divmod(r_arr[i, :], 16)
g0, g1 = divmod(g_arr[i, :], 16)
b0, b1 = divmod(b_arr[i, :], 16)
rgb[:, 0] = hex_to_ascii(r0)
rgb[:, 1] = hex_to_ascii(r1)
rgb[:, 2] = hex_to_ascii(g0)
rgb[:, 3] = hex_to_ascii(g1)
rgb[:, 4] = hex_to_ascii(b0)
rgb[:, 5] = hex_to_ascii(b1)
yield rgb.view(f'<U{2 * l}').reshape(m).tolist()
and the Numba-accelerated version:
import numba as nb
@nb.njit
def hex_to_ascii(x):
ascii_num_offset = 48 # ord(b'0') == 48
ascii_alp_offset = 87 # ord(b'a') == 97, (num of non-alpha digits) == 10
return x + (ascii_num_offset if x < 10 else ascii_alp_offset)
@nb.njit
def _to_hex_2d(x):
a, b = divmod(x, 16)
return hex_to_ascii(a), hex_to_ascii(b)
@nb.njit
def _as_str_nb(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
for i in range(n):
rgb = np.empty((m, 2 * l), dtype=np.uint32)
for j in range(m):
rgb[j, 0:2] = _to_hex_2d(r_arr[i, j])
rgb[j, 2:4] = _to_hex_2d(g_arr[i, j])
rgb[j, 4:6] = _to_hex_2d(b_arr[i, j])
yield rgb
def as_str_nb(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
for x in _as_str_nb(r_arr, g_arr, b_arr):
yield x.view(f'<U{2 * l}').reshape(m).tolist()
This essentially involves manually writing the number, correctly converted to hexadecimal ASCII chars, into a properly typed array, which can then be converted to give the desired output.
Note that the final numpy.ndarray.tolist()
could be avoided if whatever will consume the generator is capable of dealing with the NumPy array itself, thus saving some potentially large and definitely appreciable time, e.g.:
def as_str_nba(r_arr, g_arr, b_arr):
l = 3
n, m = r_arr.shape
for x in _as_str_nb(r_arr, g_arr, b_arr):
yield x.view(f'<U{2 * l}').reshape(m)
Overcoming IO-bound bottleneck
However, if you are IO-bounded you should modify your code to write in blocks, e.g using the grouper
recipe from itertools
recipes:
from itertools import zip_longest
def grouper(iterable, n, *, incomplete='fill', fillvalue=None):
"Collect data into non-overlapping fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, fillvalue='x') --> ABC DEF Gxx
# grouper('ABCDEFG', 3, incomplete='strict') --> ABC DEF ValueError
# grouper('ABCDEFG', 3, incomplete='ignore') --> ABC DEF
args = [iter(iterable)] * n
if incomplete == 'fill':
return zip_longest(*args, fillvalue=fillvalue)
if incomplete == 'strict':
return zip(*args, strict=True)
if incomplete == 'ignore':
return zip(*args)
else:
raise ValueError('Expected fill, strict, or ignore')
to be used like:
group_size = 3
for x in grouper(as_str_OP(r_arr, g_arr, b_arr), group_size):
write_many_to_disk(x)
Testing out the output
Some dummy input can be produced easily (r_arr
is essentially red_list
, etc.):
def gen_color(n, m):
return np.random.randint(0, 2 ** 8, (n, m))
N, M = 10, 3
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)
and tested by consuming the generator to produce a list
:
res_OP = list(as_str_OP(r_arr, g_arr, b_arr))
res_np = list(as_str_np(r_arr, g_arr, b_arr))
res_nb = list(as_str_nb(r_arr, g_arr, b_arr))
res_nba = list(as_str_nba(r_arr, g_arr, b_arr))
print(np.array(res_OP))
# [['1f6984' '916d98' 'f9d779']
# ['65f895' 'ded23e' '332fdc']
# ['b9e059' 'ce8676' 'cb75e9']
# ['bca0fc' '3289a9' 'cc3d3a']
# ['6bb0be' '07134a' 'c3cf05']
# ['152d5c' 'bac081' 'c59a08']
# ['97efcc' '4c31c0' '957693']
# ['15247e' 'af8f0a' 'ffb89a']
# ['161333' '8f41ce' '187b01']
# ['d811ae' '730b17' 'd2e269']]
print(res_OP == res_np)
# True
print(res_OP == res_nb)
# True
print(res_OP == [x.tolist() for x in res_nba])
# True
eventually passing through some grouping:
k = 3
res_OP = list(grouper(as_str_OP(r_arr, g_arr, b_arr), k))
res_np = list(grouper(as_str_np(r_arr, g_arr, b_arr), k))
res_nb = list(grouper(as_str_nb(r_arr, g_arr, b_arr), k))
res_nba = list(grouper(as_str_nba(r_arr, g_arr, b_arr), k))
print(np.array(res_OP))
# [[list(['1f6984', '916d98', 'f9d779'])
# list(['65f895', 'ded23e', '332fdc'])
# list(['b9e059', 'ce8676', 'cb75e9'])]
# [list(['bca0fc', '3289a9', 'cc3d3a'])
# list(['6bb0be', '07134a', 'c3cf05'])
# list(['152d5c', 'bac081', 'c59a08'])]
# [list(['97efcc', '4c31c0', '957693'])
# list(['15247e', 'af8f0a', 'ffb89a'])
# list(['161333', '8f41ce', '187b01'])]
# [list(['d811ae', '730b17', 'd2e269']) None None]]
print(res_OP == res_np)
# True
print(res_OP == res_nb)
# True
print(res_OP == [tuple(y.tolist() if y is not None else y for y in x) for x in res_nba])
# True
Benchmarks
To give you some ideas of the numbers we could be talking, let us use %timeit
on much larger inputs:
N, M = 1000, 1000
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)
%timeit -n 1 -r 1 list(as_str_OP(r_arr, g_arr, b_arr))
# 1 loop, best of 1: 1.1 s per loop
%timeit -n 4 -r 4 list(as_str_np(r_arr, g_arr, b_arr))
# 4 loops, best of 4: 279 ms per loop
%timeit -n 4 -r 4 list(as_str_nb(r_arr, g_arr, b_arr))
# 1 loop, best of 1: 96.5 ms per loop
%timeit -n 4 -r 4 list(as_str_nba(r_arr, g_arr, b_arr))
# 4 loops, best of 4: 10.4 ms per loop
To simulate disk writing we could use the following consumer:
import time
import math
def consumer(gen, timeout_sec=0.001, weight=1):
result = []
for x in gen:
result.append(x)
time.sleep(timeout_sec * weight)
return result
where disk writing is simulated with a time.sleep()
call with a timeout depending on the logarithm of the object size:
N, M = 1000, 1000
r_arr = gen_color(N, M)
g_arr = gen_color(N, M)
b_arr = gen_color(N, M)
%timeit -n 1 -r 1 consumer(as_str_OP(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 2.37 s per loop
%timeit -n 1 -r 1 consumer(as_str_np(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.48 s per loop
%timeit -n 1 -r 1 consumer(as_str_nb(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.27 s per loop
%timeit -n 1 -r 1 consumer(as_str_nba(r_arr, g_arr, b_arr), weight=math.log2(2))
# 1 loop, best of 1: 1.13 s per loop
k = 100
%timeit -n 1 -r 1 consumer(grouper(as_str_OP(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 1.17 s per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_np(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 368 ms per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_nb(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 173 ms per loop
%timeit -n 1 -r 1 consumer(grouper(as_str_nba(r_arr, g_arr, b_arr), k), weight=math.log2(1 + k))
# 1 loop, best of 1: 87.4 ms per loop
Ignoring the disk-writing simulation, the NumPy-vectorized approach is ~4x faster with the test input sizes, while Numba-accelerated approach gets ~10x to ~100x faster depending on whether the potentially useless conversion to list()
with numpy.ndarray.tolist()
is present or not.
When it comes to the simulated disk-writing, the faster versions are all more or less equivalent, and noticeably less effective without grouping, resulting in ~2x speed-up.
With grouping alone the speed-up gets to be ~2x, but when combining it with the faster approaches, the speed-ups fare between ~3x of the NumPy-vectorized version and the ~7x or ~13x of the Numba-accelerated approaches (with or without numpy.ndarray.tolist()
).
Again, this is with the given input, and under the test conditions. The actual mileage may vary.
Answered By - norok2
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.