久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

如何使用 numba 在 GPU 上泛化快速矩陣乘法

How to generalize fast matrix multiplication on GPU using numba(如何使用 numba 在 GPU 上泛化快速矩陣乘法)
本文介紹了如何使用 numba 在 GPU 上泛化快速矩陣乘法的處理方法,對(duì)大家解決問(wèn)題具有一定的參考價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)吧!

問(wèn)題描述

最近我一直在嘗試使用 Numba 庫(kù)在 Python 中進(jìn)行 GPU 編程.我一直在使用那里的教程在他們的網(wǎng)站上閱讀它,目前我堅(jiān)持使用他們的示例,可以在這里找到:https://numba.pydata.org/numba-doc/latest/cuda/examples.html.我試圖概括一下快速矩陣乘法的示例(其形式為 A*B=C).在測(cè)試時(shí),我注意到維度不能完全被每塊線程數(shù) (TPB) 整除的矩陣不會(huì)產(chǎn)生正確的答案.

Lately I've been trying to get into programming for GPUs in Python using the Numba library. I have been reading up on it on their website using the tutorial there and currently I'm stuck on their example, which can be found here: https://numba.pydata.org/numba-doc/latest/cuda/examples.html. I'm attempting to generalize the example for the fast matrix multiplication a bit (which is of the form A*B=C). When testing I noticed that matrices with dimensions that are not perfectly divisible by the number of threads per block (TPB) do not yield a correct answer.

我從 https://的示例中復(fù)制了以下代碼numba.pydata.org/numba-doc/latest/cuda/examples.html 并用 4 x 4 矩陣創(chuàng)建了一個(gè)非常小的測(cè)試用例.如果我選擇 TPB=2 一切都很好,但是當(dāng)我設(shè)置 TPB=3 時(shí)就會(huì)出錯(cuò).我知道代碼超出了矩陣的范圍,但我無(wú)法阻止這種情況的發(fā)生(我嘗試了一些關(guān)于 ty + i * TPBtx + i * 的 if 語(yǔ)句TPB,但這些都不起作用.

I copied the code below from the example at https://numba.pydata.org/numba-doc/latest/cuda/examples.html and created a very small test case with 4 by 4 matrices. If I choose TPB=2 everything is fine, but when I set TPB=3 it goes wrong. I understand that the code goes out of the bounds of the matrices, but I am unable to prevent this from happening (I tried some if statements on ty + i * TPB and tx + i * TPB, but these did not work.

from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary
        return

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    C[x, y] = tmp



#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)

我想編寫(xiě)一些不依賴(lài)于矩陣 A、B 和 C 的代碼,這些矩陣的維度可以被 TPB 完全整除,因?yàn)檫@些有時(shí)是我無(wú)法控制的.我知道 GPU 只有在處理非常大的矩陣時(shí)才能更快地進(jìn)行矩陣乘法,但我想使用一些小示例來(lái)檢查答案是否正確,然后再將其應(yīng)用于實(shí)際數(shù)據(jù).

I would like to write some code that is not dependent on the matrices A, B, and C having dimensions that are perfectly divisible by the TPB, as these are sometimes out of my control. I understand that GPUs are only faster with matrix multiplication for very large matrices, but I wanted to use small examples to be able to check whether the answer is correct before applying it to actual data.

推薦答案

可以說(shuō)在 那張貼代碼:

  1. 這不可能是一個(gè)正確的范圍檢查:

  1. This can't possibly be a correct range check:

if x >= C.shape[0] and y >= C.shape[1]:

為了讓我們確定網(wǎng)格中的特定線程不執(zhí)行任何加載活動(dòng),我們要求 要么 x 超出范圍或 y 超出范圍.and 應(yīng)該是 or.

In order for us to decide that a particular thread in the grid not do any loading activity, we require either that x is out of range or that y is out of range. The and should have been an or.

非法 如果塊中的所有線程都不能參與該語(yǔ)句,則在條件代碼中使用 cuda.syncthreads().上面第 1 項(xiàng)中的前一個(gè) return 語(yǔ)句(即使從 and 更正為 or)幾乎可以保證這種非法行為對(duì)于問(wèn)題大小不是完整的——數(shù)可被線程塊大小整除.

It is illegal to use cuda.syncthreads() in conditional code, if all the threads in the block cannot participate in that statement. The previous return statement in item 1 above (even if corrected from and to or) pretty much guarantees this illegal behavior for problem sizes not whole-number-divisible by the threadblock size.

因此,要解決這些問(wèn)題,我們不能只對(duì)越界線程使用簡(jiǎn)單的 return 語(yǔ)句.相反,在加載點(diǎn),我們必須只允許線程從全局加載到共享內(nèi)存,如果計(jì)算的全局加載索引(對(duì)于 AB)在-界限(根據(jù)定義,共享索引在界限內(nèi)).此外,在編寫(xiě)結(jié)果時(shí),我們必須只編寫(xiě) C 范圍內(nèi)的計(jì)算結(jié)果.

Therefore, to fix these issues, we cannot use just a simple return statement for an out-of-bounds thread. Instead, at the point of load, we must only allow threads to load from global to shared memory, if the computed global load indices (for A or B) are in-bounds (the shared indices are in-bounds, by definition). Furthermore, when writing a result, we must only write computed results that are in-bounds for C.

以下代碼已修復(fù)這些項(xiàng)目.它似乎適用于您給定的測(cè)試用例:

The following code has those items fixed. It seems to work correctly for your given test case:

$ cat t49.py
from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[tx, ty] = 0
        sB[tx, ty] = 0
        if x < A.shape[0] and (ty+i*TPB) < A.shape[1]:
          sA[tx, ty] = A[x, ty + i * TPB]
        if y < B.shape[1] and (tx+i*TPB) < B.shape[0]:
          sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if x < C.shape[0] and y < C.shape[1]:
        C[x, y] = tmp



#%%

x_h = np.arange(16).reshape([4,4])
y_h = np.ones([4,4])
z_h = np.zeros([4,4])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

TPB = 3
threadsperblock = (TPB, TPB)
blockspergrid_x = math.ceil(z_h.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(z_h.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
[[ 6.  6.  6.  6.]
 [22. 22. 22. 22.]
 [38. 38. 38. 38.]
 [54. 54. 54. 54.]]
========= ERROR SUMMARY: 0 errors
$

(注意這里在邊界測(cè)試中使用 是正確的.測(cè)試一組索引是否在邊界內(nèi)與測(cè)試一組索引是否在外在布爾意義上是不同的-of-bounds.在 in-bounds 測(cè)試中,我們要求兩者都是 in-bounds.在 out-of-bounds 測(cè)試中,任何一個(gè) index out-of-bounds 都是不合格的).

(Note the use of and here in the bounds tests is correct. Testing whether a set of indices are in-bound is different in a boolean sense compared to testing whether a set of indices is out-of-bounds. In the in-bounds test, we require both to be in-bounds. In the out-of-bounds test, either index out-of-bounds is disqualifying).

我并不是說(shuō)上述代碼沒(méi)有缺陷或適用于任何特定目的.提供它是為了演示對(duì)我發(fā)現(xiàn)的問(wèn)題的可能修復(fù).正如您所發(fā)現(xiàn)的,讓共享內(nèi)存平鋪矩陣乘法在每個(gè)可以想象的配置中工作并非易事,而且我沒(méi)有在此處顯示的范圍之外對(duì)其進(jìn)行測(cè)試.(例如,如果你決定讓 TPB 大于 32,你會(huì)遇到其他問(wèn)題.另外,原始發(fā)布的代碼僅用于方陣乘法,這在一般的非平方情況下不起作用.)

I'm not suggesting the above code is defect-free or suitable for any particular purpose. It is offered to demonstrate possible fixes for the issues I identified. Getting a shared-memory tiled matrix multiply to work in every imaginable configuration is non-trivial, as you have discovered, and I've not tested it beyond what is shown here. (For example, if you decided to make TPB larger than 32, you would run into other problems. Also, the original posted code is advertised only for square matrix multiplication, and this will not work in the general non-square case.)

如上所述,發(fā)布的代碼和上面帶有修復(fù)"的代碼是將無(wú)法正確處理一般的非方形情況.我相信一些簡(jiǎn)單的修改將使我們能夠處理非正方形的情況.簡(jiǎn)而言之,我們必須將網(wǎng)格的大小設(shè)置得足夠大以處理兩個(gè)輸入矩陣的維度,同時(shí)仍然只為輸出矩陣的邊界值寫(xiě)入結(jié)果.這是一個(gè)經(jīng)過(guò)簡(jiǎn)單測(cè)試的示例:

As noted above, the posted code and the above code with "fixes" will not correctly handle the general non-square case. I believe some straightforward modifications will allow us to handle the non-square case. In a nutshell, we must size the grid large enough to handle the dimensions of both input matrices, while still only writing results for the in-bounds values of the output matrix. Here is a lightly tested example:

$ cat t49.py
from numba import cuda, float32
import numpy as np
import math

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = float32(0.)
    for i in range(bpg):
        # Preload data into shared memory
        sA[ty, tx] = 0
        sB[ty, tx] = 0
        if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
          sA[ty, tx] = A[y, tx + i * TPB]
        if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
          sB[ty, tx] = B[ty + i * TPB, x]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[ty, j] * sB[j, tx]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp



#%%

x_h = np.arange(115).reshape([5,23])
y_h = np.ones([23,7])
z_h = np.zeros([5,7])

x_d = cuda.to_device(x_h)
y_d = cuda.to_device(y_h)
z_d = cuda.to_device(z_h)

#TPB must be an integer between 1 and 32
TPB = 32
threadsperblock = (TPB, TPB)
grid_y_max = max(x_h.shape[0],y_h.shape[0])
grid_x_max = max(x_h.shape[1],y_h.shape[1])
blockspergrid_x = math.ceil(grid_x_max / threadsperblock[0])
blockspergrid_y = math.ceil(grid_y_max / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

fast_matmul[blockspergrid, threadsperblock](x_d, y_d, z_d)
z_h = z_d.copy_to_host()
print(z_h)
print(x_h@y_h)
$ cuda-memcheck python t49.py
========= CUDA-MEMCHECK
[[ 253.  253.  253.  253.  253.  253.  253.]
 [ 782.  782.  782.  782.  782.  782.  782.]
 [1311. 1311. 1311. 1311. 1311. 1311. 1311.]
 [1840. 1840. 1840. 1840. 1840. 1840. 1840.]
 [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
[[ 253.  253.  253.  253.  253.  253.  253.]
 [ 782.  782.  782.  782.  782.  782.  782.]
 [1311. 1311. 1311. 1311. 1311. 1311. 1311.]
 [1840. 1840. 1840. 1840. 1840. 1840. 1840.]
 [2369. 2369. 2369. 2369. 2369. 2369. 2369.]]
========= ERROR SUMMARY: 0 errors
$

我還重新排序了 xy 的含義(以及 txty 的用法)修復(fù)上述代碼中的性能問(wèn)題.原始發(fā)布的文檔代碼中也存在相同的性能問(wèn)題.

I've also reordered the sense of x and y (and usage of tx and ty) to fix a performance issue in the above code. The same performance issue was present in the original posted doc code as well.

同樣,沒(méi)有聲稱(chēng)沒(méi)有缺陷.此外,我確信更理想".代碼可以到達(dá).然而,優(yōu)化矩陣乘法是一個(gè)應(yīng)該很快導(dǎo)致使用庫(kù)實(shí)現(xiàn)的練習(xí).在此處使用 cupy因?yàn)?GPU 方法應(yīng)該是在 GPU 上利用高質(zhì)量矩陣乘法例程的一種相當(dāng)直接的方法.

Again, no claims of defect free. Furthermore I'm sure "more optimal" code could be arrived at. However optimizing matrix multiplication is an exercise that should fairly quickly lead to using a library implementation. Using cupy here for the GPU approach should be a fairly straightforward way to tap into a high-quality matrix multiply routine on the GPU.

正如所討論的這里 OP的代碼(和,看來(lái),doc example) 也有表現(xiàn)tmp 變量的設(shè)置問(wèn)題.將其更改為適當(dāng)?shù)?32 位浮點(diǎn)變量會(huì)產(chǎn)生重要的性能差異.

As discussed here OP's code (and, it seems, the doc example) also had a performance issue around the setup of the tmp variable. Changing that to a proper 32-bit float variable makes an important performance difference.

這篇關(guān)于如何使用 numba 在 GPU 上泛化快速矩陣乘法的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!

【網(wǎng)站聲明】本站部分內(nèi)容來(lái)源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問(wèn)題,如果有圖片或者內(nèi)容侵犯了您的權(quán)益,請(qǐng)聯(lián)系我們刪除處理,感謝您的支持!

相關(guān)文檔推薦

Troubles while parsing with python very large xml file(使用 python 解析非常大的 xml 文件時(shí)出現(xiàn)問(wèn)題)
Find all nodes by attribute in XML using Python 2(使用 Python 2 在 XML 中按屬性查找所有節(jié)點(diǎn))
Python - How to parse xml response and store a elements value in a variable?(Python - 如何解析 xml 響應(yīng)并將元素值存儲(chǔ)在變量中?)
How to get XML tag value in Python(如何在 Python 中獲取 XML 標(biāo)記值)
How to correctly parse utf-8 xml with ElementTree?(如何使用 ElementTree 正確解析 utf-8 xml?)
Parse XML from URL into python object(將 XML 從 URL 解析為 python 對(duì)象)
主站蜘蛛池模板: 国产精品久久一区二区三区 | 一区二区三区视频免费看 | 国产精品久久久久久久久免费 | 久久精品一区 | 精品久久久久久久久久久久久 | 亚洲精品国产电影 | 亚洲一区二区三区 | 国产精品一区二区三区在线播放 | 欧美成人免费在线视频 | 精品九九 | av中文字幕网 | 91精品久久久久久久久中文字幕 | 国产情品 | 日本精品一区二区三区视频 | 日韩在线| 精品欧美一区二区三区久久久 | 四虎精品在线 | 久草在线高清 | 欧美极品少妇xxxxⅹ免费视频 | 久久不卡 | 男女激情网 | 国产四区| 欧美成人高清视频 | 亚洲九九精品 | 亚洲 自拍 另类 欧美 丝袜 | 欧美精品影院 | 亚洲一区二区三区在线 | 日韩免费视频一区二区 | 日本三级线观看 视频 | 国产毛片久久久 | 狠狠操电影 | 久久最新 | 国产欧美精品一区二区三区 | 亚洲九九色 | 免费a在线 | 国产精品亚洲综合 | 免费一级毛片 | 狠狠狠色丁香婷婷综合久久五月 | 久久综合狠狠综合久久综合88 | 欧美国产免费 | 日本亚洲欧美 |