Issue
I saw a video about speed of loops in python, where it was explained that doing sum(range(N))
is much faster than manually looping through range
and adding the variables together, since the former runs in C due to built-in functions being used, while in the latter the summation is done in (slow) python. I was curious what happens when adding numpy
to the mix. As I expected np.sum(np.arange(N))
is the fastest, but sum(np.arange(N))
and np.sum(range(N))
are even slower than doing the naive for loop.
Why is this?
Here's the script I used to test, some comments about the supposed cause of slowing done where I know (taken mostly from the video) and the results I got on my machine (python 3.10.0, numpy 1.21.2):
updated script:
import numpy as np
from timeit import timeit
N = 10_000_000
repetition = 10
def sum0(N = N):
s = 0
i = 0
while i < N: # condition is checked in python
s += i
i += 1 # both additions are done in python
return s
def sum1(N = N):
s = 0
for i in range(N): # increment in C
s += i # addition in python
return s
def sum2(N = N):
return sum(range(N)) # everything in C
def sum3(N = N):
return sum(list(range(N)))
def sum4(N = N):
return np.sum(range(N)) # very slow np.array conversion
def sum5(N = N):
# much faster np.array conversion
return np.sum(np.fromiter(range(N),dtype = int))
def sum5v2_(N = N):
# much faster np.array conversion
return np.sum(np.fromiter(range(N),dtype = np.int_))
def sum6(N = N):
# possibly slow conversion to Py_long from np.int
return sum(np.arange(N))
def sum7(N = N):
# list returns a list of np.int-s
return sum(list(np.arange(N)))
def sum7v2(N = N):
# tolist conversion to python int seems faster than the implicit conversion
# in sum(list()) (tolist returns a list of python int-s)
return sum(np.arange(N).tolist())
def sum8(N = N):
return np.sum(np.arange(N)) # everything in numpy (fortran libblas?)
def sum9(N = N):
return np.arange(N).sum() # remove dispatch overhead
def array_basic(N = N):
return np.array(range(N))
def array_dtype(N = N):
return np.array(range(N),dtype = np.int_)
def array_iter(N = N):
# np.sum's source code mentions to use fromiter to convert from generators
return np.fromiter(range(N),dtype = np.int_)
print(f"while loop: {timeit(sum0, number = repetition)}")
print(f"for loop: {timeit(sum1, number = repetition)}")
print(f"sum_range: {timeit(sum2, number = repetition)}")
print(f"sum_rangelist: {timeit(sum3, number = repetition)}")
print(f"npsum_range: {timeit(sum4, number = repetition)}")
print(f"npsum_iterrange: {timeit(sum5, number = repetition)}")
print(f"npsum_iterrangev2: {timeit(sum5, number = repetition)}")
print(f"sum_arange: {timeit(sum6, number = repetition)}")
print(f"sum_list_arange: {timeit(sum7, number = repetition)}")
print(f"sum_arange_tolist: {timeit(sum7v2, number = repetition)}")
print(f"npsum_arange: {timeit(sum8, number = repetition)}")
print(f"nparangenpsum: {timeit(sum9, number = repetition)}")
print(f"array_basic: {timeit(array_basic, number = repetition)}")
print(f"array_dtype: {timeit(array_dtype, number = repetition)}")
print(f"array_iter: {timeit(array_iter, number = repetition)}")
print(f"npsumarangeREP: {timeit(lambda : sum8(N/1000), number = 100000*repetition)}")
print(f"npsumarangeREP: {timeit(lambda : sum9(N/1000), number = 100000*repetition)}")
# Example output:
#
# while loop: 11.493371912998555
# for loop: 7.385945574002108
# sum_range: 2.4605720699983067
# sum_rangelist: 4.509678105998319
# npsum_range: 11.85120212900074
# npsum_iterrange: 4.464334709002287
# npsum_iterrangev2: 4.498494338993623
# sum_arange: 9.537815956995473
# sum_list_arange: 13.290120724996086
# sum_arange_tolist: 5.231948580003518
# npsum_arange: 0.241889145996538
# nparangenpsum: 0.21876695199898677
# array_basic: 11.736577274998126
# array_dtype: 8.71628468400013
# array_iter: 4.303306431000237
# npsumarangeREP: 21.240833958996518
# npsumarangeREP: 16.690092379001726
Solution
np.sum(range(N))
is slow mostly because the current Numpy implementation do not use enough informations about the exact type/content of the values provided by the generator range(N)
. The heart of the general problem is inherently due to dynamic typing of Python and big integers although Numpy could optimize this specific case.
First of all, range(N)
returns a dynamically-typed Python object which is a (special kind of) Python generator. The object provided by this generator are also dynamically-typed. It is in practice a pure-Python integer.
The thing is Numpy is written in the statically-typed language C and so it cannot efficiently work on dynamically-typed pure-Python objects. The strategy of Numpy is to convert such objects into C types when it can. One big problem in this case is that the integers provided by the generator can theorically be huge: Numpy do not know if the values can overflow a np.int32
or even a np.int64
type. Thus, Numpy first detect the good type to use and then compute the result using this type.
This translation process can be quite expensive and appear not to be needed here since all the values provided by range(10_000_000)
. However, range(5_000_000_000)
returns the same object type with pure-Python integers overflowing np.int32
and Numpy needs to automatically detect this case not to return wrong results. The thing is also the input type can be correctly identified (np.int32
on my machine), it does not means that the output result will be correct because overflows can appear in during the computation of the sum. This is sadly the case on my machine.
Numpy developers decided to deprecate such a use and put in the documentation that np.fromiter
should be used instead. np.fromiter
has a dtype
required parameter to let the user define what is the good type to use.
One way to check this behaviour in practice is to simply use create a temporary list:
tmp = list(range(10_000_000))
# Numpy implicitly convert the list in a Numpy array but
# still automatically detect the input type to use
np.sum(tmp)
A faster implementation is the following:
tmp = list(range(10_000_000))
# The array is explicitly converted using a well-defined type and
# thus there is no need to perform an automatic detection
# (note that the result is still wrong since it does not fit in a np.int32)
tmp2 = np.array(tmp, dtype=np.int32)
result = np.sum(tmp2)
The first case takes 476 ms on my machine while the second takes 289 ms. Note that np.sum
takes only 4 ms. Thus, a large part of the time is spend in the conversion of pure-Python integer objects to internal int32 types (more specifically the management of pure-Python integers). list(range(10_000_000))
is expensive too as it takes 205 ms. This is again due to the overhead of pure-Python integers (ie. allocations, deallocations, reference counting, increment of variable-sized integers, memory indirections and conditions due to the dynamic typing) as well as the overhead of the generator.
sum(np.arange(N))
is slow because sum
is a pure-Python function working on a Numpy-defined object. The CPython interpreter needs to call Numpy functions to perform basic additions. Moreover, Numpy-defined integer object are still Python object and so they are subject to reference counting, allocation, deallocation, etc. Not to mention Numpy and CPython add many checks in the functions aiming to finally just add two native numbers together. A Numpy-aware just-in-time compiler such as Numba can solve this issue. Indeed, Numba takes 23 ms on my machine to compute the sum of np.arange(10_000_000)
(with code still written in Python) while the CPython interpreter takes 556 ms.
Answered By - Jérôme Richard
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.