How To Avoid Huge Overhead Of Single-threaded Numpy's Transpose?
Solution 1:
Computing transpositions efficiently is hard. This primitive is not compute-bound but memory-bound. This is especially true for big matrices stored in the RAM (and not CPU caches).
Numpy use a view-based approach which is great when only a slice of the array is needed and quite slow the computation is done eagerly (eg. when a copy is performed). The way Numpy is implemented results in a lot of cache misses strongly decreasing performance when a copy is performed in this case.
I found this function virtually always run in single-threaded, whatever how large the transposed matrix/array is.
This is clearly dependant of the Numpy implementation used. AFAIK, some optimized packages like the one of Intel are more efficient and take more often advantage of multithreading.
I think a parallelized transpose function should be a resolution. The problem is how to implement it.
Yes and no. It may not be necessary faster to use more threads. At least not much more, and not on all platforms. The algorithm used is far more important than using multiple threads.
On modern desktop x86-64 processors, each core can be bounded by its cache hierarchy. But this limit is quite high. Thus, two cores are often enough to nearly saturate the RAM throughput. For example, on my (4-core) machine, a sequential copy can reach 20.4 GiB/s (Numpy succeed to reach this limit), while my (practical) memory throughput is close to 35 GiB/s. Copying A
takes 72 ms while the naive Numpy transposition A
takes 700 ms. Even using all my cores, a parallel implementation would not be faster than 175 ms while the optimal theoretical time is 42 ms. Actually, a naive parallel implementation would be much slower than 175 ms because of caches-misses and the saturation of my L3 cache.
Naive transposition implementations do not write/read data contiguously. The memory access pattern is strided and most cache-lines are wasted. Because of this, data are read/written multiple time from memory on huge matrices (typically 8 times on current x86-64 platforms using double-precision). Tile-based transposition algorithm is an efficient way to prevent this issue. It also strongly reduces cache misses. Ideally, hierarchical tiles should be used or the cache-oblivious Z-tiling pattern although this is hard to implement.
Here is a Numba-based implementation based on the previous informations:
@nb.njit('void(float64[:,::1], float64[:,::1])', parallel=True)deftranspose(mat, out):
blockSize, tileSize = 256, 32# To be tuned
n, m = mat.shape
assert blockSize % tileSize == 0for tmp in nb.prange((m+blockSize-1)//blockSize):
i = tmp * blockSize
for j inrange(0, n, blockSize):
tiMin, tiMax = i, min(i+blockSize, m)
tjMin, tjMax = j, min(j+blockSize, n)
for ti inrange(tiMin, tiMax, tileSize):
for tj inrange(tjMin, tjMax, tileSize):
out[ti:ti+tileSize, tj:tj+tileSize] = mat[tj:tj+tileSize, ti:ti+tileSize].T
If you want a faster code, you can use very-optimized native libraries implementing the transposition like the Intel MKL. Such libraries often take advantage of low-level processor-specific instructions (SIMD instruction & non-temporal stores) to use the caches/RAM more efficiently.
Here are the timing results (assuming the output matrix is already filled in memory):
Naive Numpy: 700 ms
Above code without Numba (sequential): 287 ms
Numba version (sequential): 157 ms
Numba version (parallel): 104 ms
Very-optimized C code (parallel): 68 ms
Theoretical optimum: 42 ms
Post a Comment for "How To Avoid Huge Overhead Of Single-threaded Numpy's Transpose?"