問題描述
我嘗試將 cuda python 與 numba 一起使用.代碼是計算一維數(shù)組的總和如下,但我不知道如何得到一個值結(jié)果而不是三個值.
I try to use cuda python with numba. The code is to calculate the sum of a 1D array as follows, but I don't know how to get one value result rather than three values.
python3.5 與 numba+ CUDA8.0
python3.5 with numba + CUDA8.0
import os,sys,time
import pandas as pd
import numpy as np
from numba import cuda, float32
os.environ['NUMBAPRO_NVVM']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0
vvmin
vvm64_31_0.dll'
os.environ['NUMBAPRO_LIBDEVICE']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0
vvmlibdevice'
bpg = (1,1)
tpb = (1,3)
@cuda.jit
def calcu_sum(D,T):
ty = cuda.threadIdx.y
bh = cuda.blockDim.y
index_i = ty
L = len(D)
su = 0
while index_i<L:
su +=D[index_i]
index_i +=bh
print('su:',su)
T[0,0]=su
print('T:',T[0,0])
D = np.array([ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972,
0.43940639,0.43864551,0.43447691,0.43120232], dtype=np.float32)
T = np.empty([1,1])
print('D: ',D)
stream = cuda.stream()
with stream.auto_synchronize():
dD = cuda.to_device(D, stream)
dT= cuda.to_device(TE, stream)
calcu_sum[bpg, tpb, stream](dD,dT)
輸出是:
D: [ 0.42487645 0.41607881 0.42027071 0.43751907 0.43512794 0.43656972
0.43940639 0.43864551 0.43447691 0.43120232]
su: 1.733004
su: 1.289852
su: 1.291317
T: 1.733004
T: 1.289852
T: 1.291317
為什么我不能得到輸出 "4.31417383" 而不是 "1.733004 1.289852 1.291317" ?1.733004+1.289852+1.291317=4.314173.
Why can't I get the output "4.31417383" rather than "1.733004 1.289852 1.291317" ? 1.733004+1.289852+1.291317=4.314173.
我是 numba 新手,閱讀 numba 文檔,但不知道如何操作.有人可以給點建議嗎?
I'm new to numba, read the numba documentation, but don't know how to do it. Can someone give advice ?
推薦答案
你沒有得到你期望的總和的原因是你沒有編寫代碼來產(chǎn)生這個總和.
The reason you don't get the sum you expect is because you haven't written code to produce that sum.
基本的 CUDA 編程模型(無論您使用 CUDA C、Fortran 還是 Python 作為您的語言)是您編寫由每個線程執(zhí)行的內(nèi)核代碼.您已經(jīng)為每個線程編寫了代碼來讀取和求和輸入數(shù)組的一部分.您還沒有為這些線程編寫任何代碼來共享它們各自的部分總和并將其匯總為最終總和.
The basic CUDA programming model (whether you use CUDA C, Fortran or Python as your language) is that you write kernel code which is executed by each thread. You have written code for each thread to read and sum part of the input array. You have not written any code for those threads to share and sum their individual partial sums into a final sum.
有一個非常描述得很好的算法可以做到這一點——它被稱為并行歸約.您可以在每個版本的 CUDA 工具包的示例中找到該算法的介紹,或下載有關(guān)它的演示文稿 這里.您還可以閱讀更現(xiàn)代的算法版本,它使用 CUDA 的新功能(warp shuffle 指令和原子事務(wù))這里.
There is an extremely well described algorithm for doing this -- it is called a parallel reduction. You can find an introduction to the algorithm in a PDF which ships in the examples of every version of the CUDA toolkit, or download a presentation about it here. You can also read a more modern version of the algorithm which uses newer features of CUDA (warp shuffle instructions and atomic transactions) here.
在學習了歸約算法之后,您需要將標準 CUDA C 內(nèi)核代碼改編為 Numba Python 內(nèi)核方言.至少是這樣的:
After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. At the bare minimum, something like this:
tpb = (1,3)
@cuda.jit
def calcu_sum(D,T):
ty = cuda.threadIdx.y
bh = cuda.blockDim.y
index_i = ty
sbuf = cuda.shared.array(tpb, float32)
L = len(D)
su = 0
while index_i < L:
su += D[index_i]
index_i +=bh
print('su:',su)
sbuf[0,ty] = su
cuda.syncthreads()
if ty == 0:
T[0,0] = 0
for i in range(0, bh):
T[0,0] += sbuf[0,i]
print('T:',T[0,0])
可能會做你想做的事,盡管距離最優(yōu)的并行共享內(nèi)存減少還有很長的路要走,當你閱讀我提供的鏈接時你會看到.
will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to.
這篇關(guān)于為什么我不能用 numba (cuda python) 得到正確的一維數(shù)組的總和?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!