問(wèn)題描述
最近我一直在嘗試使用 Numba 庫(kù)在 Python 中進(jìn)行 GPU 編程.我一直在使用那里的教程在他們的網(wǎng)站上閱讀它,目前我堅(jiān)持使用他們的示例,可以在這里找到:https://numba.pydata.org/numba-doc/latest/cuda/examples.html.我試圖概括一下快速矩陣乘法的示例(其形式為 A*B=C).在測(cè)試時(shí),我注意到維度不能完全被每塊線程數(shù) (TPB) 整除的矩陣不會(huì)產(chǎn)生正確的答案.
Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the number of threads per block (TPB) do not yield a correct answer.
我從 https://的示例中復(fù)制了以下代碼numba.pydata.org/numba-doc/latest/cuda/examples.html 并用 4 x 4 矩陣創(chuàng)建了一個(gè)非常小的測(cè)試用例.如果我選擇 TPB=2 一切都很好,但是當(dāng)我設(shè)置 TPB=3 時(shí)就會(huì)出錯(cuò).我知道代碼超出了矩陣的范圍,但我無(wú)法阻止這種情況的發(fā)生(我嘗試了一些關(guān)于 ty + i * TPB 和 tx + i * 的 if 語(yǔ)句TPB,但這些都不起作用.
I copied the code below from the example at https://numba.pydata.org/numba-doc/latest/cuda/examples.html and created a very small test case with 4 by 4 matrices. If I choose TPB=2 everything is fine, but when I set TPB=3 it goes wrong. I understand that the code goes out of the bounds of the matrices, but I am unable to prevent this from happening (I tried some if statements on ty + i * TPB and tx + i * TPB, but these did not work.
from numba import cuda, float32
import numpy as np
import math
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
我想編寫(xiě)一些不依賴(lài)于矩陣 A、B 和 C 的代碼,這些矩陣的維度可以被 TPB 完全整除,因?yàn)檫@些有時(shí)是我無(wú)法控制的.我知道 GPU 只有在處理非常大的矩陣時(shí)才能更快地進(jìn)行矩陣乘法,但我想使用一些小示例來(lái)檢查答案是否正確,然后再將其應(yīng)用于實(shí)際數(shù)據(jù).
I would like to write some code that is not dependent on the matrices A, B, and C having dimensions that are perfectly divisible by the TPB, as these are sometimes out of my control. I understand that GPUs are only faster with matrix multiplication for very large matrices, but I wanted to use small examples to be able to check whether the answer is correct before applying it to actual data.
推薦答案
可以說(shuō)在 那張貼代碼:
這不可能是一個(gè)正確的范圍檢查:
This can't possibly be a correct range check:
if x >= C.shape[0] and y >= C.shape[1]:
為了讓我們確定網(wǎng)格中的特定線程不執(zhí)行任何加載活動(dòng),我們要求 要么 x
超出范圍或 y
超出范圍.and
應(yīng)該是 or
.
In order for us to decide that a particular thread in the grid not do any loading activity, we require either that x
is out of range or that y
is out of range. The and
should have been an or
.
非法 如果塊中的所有線程都不能參與該語(yǔ)句,則在條件代碼中使用 cuda.syncthreads()
.上面第 1 項(xiàng)中的前一個(gè) return
語(yǔ)句(即使從 and
更正為 or
)幾乎可以保證這種非法行為對(duì)于問(wèn)題大小不是完整的——數(shù)可被線程塊大小整除.
It is illegal to use cuda.syncthreads()
in conditional code, if all the threads in the block cannot participate in that statement. The previous return
statement in item 1 above (even if corrected from and
to or
) pretty much guarantees this illegal behavior for problem sizes not whole-number-divisible by the threadblock size.
因此,要解決這些問(wèn)題,我們不能只對(duì)越界線程使用簡(jiǎn)單的 return
語(yǔ)句.相反,在加載點(diǎn),我們必須只允許線程從全局加載到共享內(nèi)存,如果計(jì)算的全局加載索引(對(duì)于 A
或 B
)在-界限(根據(jù)定義,共享索引在界限內(nèi)).此外,在編寫(xiě)結(jié)果時(shí),我們必須只編寫(xiě) C
范圍內(nèi)的計(jì)算結(jié)果.
Therefore, to fix these issues, we cannot use just a simple return
statement for an out-of-bounds thread. Instead, at the point of load, we must only allow threads to load from global to shared memory, if the computed global load indices (for A
or B
) are in-bounds (the shared indices are in-bounds, by definition). Furthermore, when writing a result, we must only write computed results that are in-bounds for C
.
以下代碼已修復(fù)這些項(xiàng)目.它似乎適用于您給定的測(cè)試用例:
The following code has those items fixed. It seems to work correctly for your given test case:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = 0
sB[tx, ty] = 0
if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
sA[tx, ty] = A[x, ty + i * TPB]
if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
if x < C.shape[0] and y < C.shape[1]:
C[x, y] = tmp
#%%
x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
[[ 6. 6. 6. 6.]
[22. 22. 22. 22.]
[38. 38. 38. 38.]
[54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$
(注意這里在邊界測(cè)試中使用 和
是正確的.測(cè)試一組索引是否在邊界內(nèi)與測(cè)試一組索引是否在外在布爾意義上是不同的-of-bounds.在 in-bounds 測(cè)試中,我們要求兩者都是 in-bounds.在 out-of-bounds 測(cè)試中,任何一個(gè) index out-of-bounds 都是不合格的).
(Note the use of and
here in the bounds tests is correct. Testing whether a set of indices are in-bound is different in a boolean sense compared to testing whether a set of indices is out-of-bounds. In the in-bounds test, we require both to be in-bounds. In the out-of-bounds test, either index out-of-bounds is disqualifying).
我并不是說(shuō)上述代碼沒(méi)有缺陷或適用于任何特定目的.提供它是為了演示對(duì)我發(fā)現(xiàn)的問(wèn)題的可能修復(fù).正如您所發(fā)現(xiàn)的,讓共享內(nèi)存平鋪矩陣乘法在每個(gè)可以想象的配置中工作并非易事,而且我沒(méi)有在此處顯示的范圍之外對(duì)其進(jìn)行測(cè)試.(例如,如果你決定讓 TPB 大于 32,你會(huì)遇到其他問(wèn)題.另外,原始發(fā)布的代碼僅用于方陣乘法,這在一般的非平方情況下不起作用.)
I'm not suggesting the above code is defect-free or suitable for any particular purpose. It is offered to demonstrate possible fixes for the issues I identified. Getting a shared-memory tiled matrix multiply to work in every imaginable configuration is non-trivial, as you have discovered, and I've not tested it beyond what is shown here. (For example, if you decided to make TPB larger than 32, you would run into other problems. Also, the original posted code is advertised only for square matrix multiplication, and this will not work in the general non-square case.)
如上所述,發(fā)布的代碼和上面帶有修復(fù)"的代碼是將無(wú)法正確處理一般的非方形情況.我相信一些簡(jiǎn)單的修改將使我們能夠處理非正方形的情況.簡(jiǎn)而言之,我們必須將網(wǎng)格的大小設(shè)置得足夠大以處理兩個(gè)輸入矩陣的維度,同時(shí)仍然只為輸出矩陣的邊界值寫(xiě)入結(jié)果.這是一個(gè)經(jīng)過(guò)簡(jiǎn)單測(cè)試的示例:
As noted above, the posted code and the above code with "fixes" will not correctly handle the general non-square case. I believe some straightforward modifications will allow us to handle the non-square case. In a nutshell, we must size the grid large enough to handle the dimensions of both input matrices, while still only writing results for the in-bounds values of the output matrix. Here is a lightly tested example:
$ cat t49.py
from numba import cuda, float32
import numpy as np
import math
@cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = float32(0.)
for i in range(bpg):
# Preload data into shared memory
sA[ty, tx] = 0
sB[ty, tx] = 0
if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
sA[ty, tx] = A[y, tx + i * TPB]
if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
sB[ty, tx] = B[ty + i * TPB, x]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[ty, j] * sB[j, tx]
# Wait until all threads finish computing
cuda.syncthreads()
if y < C.shape[0] and x < C.shape[1]:
C[y, x] = tmp
#%%
x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])
x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)
#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253. 253. 253. 253. 253. 253. 253.]
[ 782. 782. 782. 782. 782. 782. 782.]
[1311. 1311. 1311. 1311. 1311. 1311. 1311.]
[1840. 1840. 1840. 1840. 1840. 1840. 1840.]
[2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$
我還重新排序了 x
和 y
的含義(以及 tx
和 ty
的用法)修復(fù)上述代碼中的性能問(wèn)題.原始發(fā)布的文檔代碼中也存在相同的性能問(wèn)題.
I've also reordered the sense of x
and y
(and usage of tx
and ty
) to fix a performance issue in the above code. The same performance issue was present in the original posted doc code as well.
同樣,沒(méi)有聲稱(chēng)沒(méi)有缺陷.此外,我確信更理想".代碼可以到達(dá).然而,優(yōu)化矩陣乘法是一個(gè)應(yīng)該很快導(dǎo)致使用庫(kù)實(shí)現(xiàn)的練習(xí).在此處使用 cupy
因?yàn)?GPU 方法應(yīng)該是在 GPU 上利用高質(zhì)量矩陣乘法例程的一種相當(dāng)直接的方法.
Again, no claims of defect free. Furthermore I'm sure "more optimal" code could be arrived at. However optimizing matrix multiplication is an exercise that should fairly quickly lead to using a library implementation. Using cupy
here for the GPU approach should be a fairly straightforward way to tap into a high-quality matrix multiply routine on the GPU.
正如所討論的這里 OP的代碼(和,看來(lái),doc example) 也有表現(xiàn)tmp
變量的設(shè)置問(wèn)題.將其更改為適當(dāng)?shù)?32 位浮點(diǎn)變量會(huì)產(chǎn)生重要的性能差異.
As discussed here OP's code (and, it seems, the doc example) also had a performance issue around the setup of the tmp
variable. Changing that to a proper 32-bit float variable makes an important performance difference.
這篇關(guān)于如何使用 numba 在 GPU 上泛化快速矩陣乘法的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!