Issue
I have a script that uses a lot of numpy and numpy.linalg functions and after some reaserch it tourned out that supposedly they automaticaly use multithreading. Altought that, my htop display always shows just one thread being used to run my script.
I am new to multithreading and I don´t quite now how to set up it correctly.
I am mostly making use of numpy.linalg.svd
Here is the output of numpy.show_config()
openblas64__info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
MRE
import numpy as np
import tensorly as ty
tensor = np.random.rand(32,32,32,32)
unfolding = ty.unfold(tensor,0)
unfolding = unfolding @ unfolding.transpose()
U,S,_ = np.linalg.svd(unfolding)
Update
As suggested in the accepted answer, rebuilding numpy with MKL solved the issue.
Solution
The main issue is that the size of the matrices is too small for threads to be really worth it on all platforms. Indeed, OpenBLAS uses OpenMP to create threads regarding the size of the matrix. Threads are generally created once but the creation can take from dozens of microseconds to dozens of milliseconds regarding the machine (typically hundreds of microseconds on a regular PC). The bigger the number of cores on the machine the bigger the number of threads to create and so the bigger the overhead. When the OpenMP thread pool is reused, there are still overheads to pay mainly due to the distribution of the work and synchronizations between threads though the overheads are generally significantly smaller (typically an order of magnitude smaller).
That being said, OpenBLAS makes clearly sub-optimal choices when the output matrix is tiny compared to the input ones (which is your case). Indeed, OpenBLAS can hardly know the parallel overhead before running the target kernel so it has to make a choice: set a threshold typically based on the size of the input matrix so to define when the kernel will be executed sequentially or with multiple threads. This is critical for very small kernel to still be fast as well as huge ones to remain competitive with other BLAS implementations. The thing is this threshold is not perfectly chosen. It looks like OpenBLAS only look the size of the output matrix which is clearly sub-optimal for "thin" matrices like in your code (eg. 50x1000000 @ 1000000x50). An empirical analysis show that the threshold is arbitrary set to 100x100 in your case: beyond this threshold, OpenBLAS use multiple threads but not otherwise. The thing is threads are already useful for significantly smaller matrices in your case on most platforms (eg. for 64x64x64x64 tensors).
This threshold is tuned by compile-time definitions like GEMM_MULTITHREAD_THRESHOLD
which is used in gemm.c (or gemv.c. Note that in the code, the k dimension matters but this is not what benchmarks show on my machine (possibly due to an older version of OpenBLAS being used). You can rebuild OpenBLAS with a smaller threshold (like 1 instead of 4).
An alternative solution is to use another BLAS implementation like BLIS or the Intel MKL that should use different threshold (possibly better ones). A last solution is to implement a specific implementation to efficiently compute the matrices of your code (possibly using Numba or Cython) but BLAS implementations are heavily optimized so it is often hard to actually write a faster code (unless you are very familiar with low-level optimizations, compilers, and modern processor architectures).
Answered By - Jérôme Richard
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.