久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

為什么我不能用 numba (cuda python) 得到正確的一維

why can#39;t I get the right sum of 1D array with numba (cuda python)?(為什么我不能用 numba (cuda python) 得到正確的一維數(shù)組的總和?)
本文介紹了為什么我不能用 numba (cuda python) 得到正確的一維數(shù)組的總和?的處理方法,對大家解決問題具有一定的參考價值,需要的朋友們下面隨著小編來一起學習吧!

問題描述

我嘗試將 cuda python 與 numba 一起使用.代碼是計算一維數(shù)組的總和如下,但我不知道如何得到一個值結(jié)果而不是三個值.

I try to use cuda python with numba. The code is to calculate the sum of a 1D array as follows, but I don't know how to get one value result rather than three values.

python3.5 與 numba+ CUDA8.0

python3.5 with numba + CUDA8.0

import os,sys,time
import pandas as pd
import numpy as np
from numba import cuda, float32

os.environ['NUMBAPRO_NVVM']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0
vvmin
vvm64_31_0.dll'
os.environ['NUMBAPRO_LIBDEVICE']=r'D:NVIDIA GPU Computing ToolkitCUDAv8.0
vvmlibdevice'

bpg = (1,1) 
tpb = (1,3) 

@cuda.jit
def calcu_sum(D,T):
    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y
    index_i = ty
    L = len(D)
    su = 0
    while index_i<L:
        su +=D[index_i]
        index_i +=bh
    print('su:',su)
    T[0,0]=su
    print('T:',T[0,0])


D = np.array([ 0.42487645,0.41607881,0.42027071,0.43751907,0.43512794,0.43656972,
               0.43940639,0.43864551,0.43447691,0.43120232], dtype=np.float32)
T = np.empty([1,1])
print('D: ',D)

stream = cuda.stream()
with stream.auto_synchronize():
    dD = cuda.to_device(D, stream)
    dT= cuda.to_device(TE, stream)
    calcu_sum[bpg, tpb, stream](dD,dT)

輸出是:

D:  [ 0.42487645  0.41607881  0.42027071  0.43751907  0.43512794  0.43656972
  0.43940639  0.43864551  0.43447691  0.43120232]
su:  1.733004
su:  1.289852
su:  1.291317
T: 1.733004
T: 1.289852
T: 1.291317

為什么我不能得到輸出 "4.31417383" 而不是 "1.733004 1.289852 1.291317" ?1.733004+1.289852+1.291317=4.314173.

Why can't I get the output "4.31417383" rather than "1.733004 1.289852 1.291317" ? 1.733004+1.289852+1.291317=4.314173.

我是 numba 新手,閱讀 numba 文檔,但不知道如何操作.有人可以給點建議嗎?

I'm new to numba, read the numba documentation, but don't know how to do it. Can someone give advice ?

推薦答案

你沒有得到你期望的總和的原因是你沒有編寫代碼來產(chǎn)生這個總和.

The reason you don't get the sum you expect is because you haven't written code to produce that sum.

基本的 CUDA 編程模型(無論您使用 CUDA C、Fortran 還是 Python 作為您的語言)是您編寫由每個線程執(zhí)行的內(nèi)核代碼.您已經(jīng)為每個線程編寫了代碼來讀取和求和輸入數(shù)組的一部分.您還沒有為這些線程編寫任何代碼來共享它們各自的部分總和并將其匯總為最終總和.

The basic CUDA programming model (whether you use CUDA C, Fortran or Python as your language) is that you write kernel code which is executed by each thread. You have written code for each thread to read and sum part of the input array. You have not written any code for those threads to share and sum their individual partial sums into a final sum.

有一個非常描述得很好的算法可以做到這一點——它被稱為并行歸約.您可以在每個版本的 CUDA 工具包的示例中找到該算法的介紹,或下載有關(guān)它的演示文稿 這里.您還可以閱讀更現(xiàn)代的算法版本,它使用 CUDA 的新功能(warp shuffle 指令和原子事務(wù))這里.

There is an extremely well described algorithm for doing this -- it is called a parallel reduction. You can find an introduction to the algorithm in a PDF which ships in the examples of every version of the CUDA toolkit, or download a presentation about it here. You can also read a more modern version of the algorithm which uses newer features of CUDA (warp shuffle instructions and atomic transactions) here.

在學習了歸約算法之后,您需要將標準 CUDA C 內(nèi)核代碼改編為 Numba Python 內(nèi)核方言.至少是這樣的:

After you have studied the reduction algorithm, you will need to adapt the standard CUDA C kernel code into the Numba Python kernel dialect. At the bare minimum, something like this:

tpb = (1,3) 

@cuda.jit
def calcu_sum(D,T):

    ty = cuda.threadIdx.y
    bh = cuda.blockDim.y
    index_i = ty
    sbuf = cuda.shared.array(tpb, float32)

    L = len(D)
    su = 0
    while index_i < L:
        su += D[index_i]
        index_i +=bh

    print('su:',su)

    sbuf[0,ty] = su
    cuda.syncthreads()

    if ty == 0:
        T[0,0] = 0
        for i in range(0, bh):
            T[0,0] += sbuf[0,i]
        print('T:',T[0,0])

可能會做你想做的事,盡管距離最優(yōu)的并行共享內(nèi)存減少還有很長的路要走,當你閱讀我提供的鏈接時你會看到.

will probably do what you want, although it is still a long way from an optimal parallel shared memory reduction, as you will see when you read the material I provided links to.

這篇關(guān)于為什么我不能用 numba (cuda python) 得到正確的一維數(shù)組的總和?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!

【網(wǎng)站聲明】本站部分內(nèi)容來源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問題,如果有圖片或者內(nèi)容侵犯了您的權(quán)益,請聯(lián)系我們刪除處理,感謝您的支持!

相關(guān)文檔推薦

Troubles while parsing with python very large xml file(使用 python 解析非常大的 xml 文件時出現(xiàn)問題)
Find all nodes by attribute in XML using Python 2(使用 Python 2 在 XML 中按屬性查找所有節(jié)點)
Python - How to parse xml response and store a elements value in a variable?(Python - 如何解析 xml 響應(yīng)并將元素值存儲在變量中?)
How to get XML tag value in Python(如何在 Python 中獲取 XML 標記值)
How to correctly parse utf-8 xml with ElementTree?(如何使用 ElementTree 正確解析 utf-8 xml?)
Parse XML from URL into python object(將 XML 從 URL 解析為 python 對象)
主站蜘蛛池模板: 亚洲成人网在线 | 亚洲一区二区电影在线观看 | 天天射夜夜操 | 国产精品视频网 | 刘亦菲国产毛片bd | 中文在线一区二区 | 亚洲最大看片网站 | 免费黄色a级毛片 | 欧美激情精品久久久久久免费 | 国产精品小视频在线观看 | 免费在线观看av片 | 欧美日韩在线一区二区 | 一级一级一级毛片 | 91九色麻豆 | 亚洲国产成人精品在线 | 久久九九影视 | caoporn国产精品免费公开 | 久久天堂网 | 亚洲天堂中文字幕 | 国产精品大全 | 青青草精品| 欧美精品一区二区三区四区 在线 | 久色激情 | 国产熟熟| 国产成人精品免费视频 | 亚洲精品中文在线 | 亚洲精品免费在线观看 | 综合婷婷| 国产99热在线 | 午夜激情影院 | 中文字幕在线播放第一页 | 91精品久久久久久久久久小网站 | 亚洲精品在线视频 | 成人欧美日韩一区二区三区 | 999精品在线观看 | 国产99在线 | 欧美 | 伊人春色成人网 | 日日夜夜视频 | 久久久久国产一区二区三区 | 国产精品亚洲欧美日韩一区在线 | 欧美激情综合色综合啪啪五月 |