問題描述
我開始使用 C++ 使用 OpenMP.
I started working with OpenMP using C++.
我有兩個問題:
- 什么是
#pragma omp for schedule
? dynamic
和static
有什么區(qū)別?
- What is
#pragma omp for schedule
? - What is the difference between
dynamic
andstatic
?
請舉例說明.
推薦答案
其他人已經(jīng)回答了大部分問題,但我想指出一些特定的情況,其中特定的調(diào)度類型比其他的更適合.調(diào)度控制如何在線程之間劃分循環(huán)迭代.選擇正確的時間表會對應用程序的速度產(chǎn)生很大影響.
Others have since answered most of the question but I would like to point to some specific cases where a particular scheduling type is more suited than the others. Schedule controls how loop iterations are divided among threads. Choosing the right schedule can have great impact on the speed of the application.
static
調(diào)度意味著迭代塊以循環(huán)方式靜態(tài)映射到執(zhí)行線程.靜態(tài)調(diào)度的好處在于,OpenMP 運行時保證如果您有兩個具有相同迭代次數(shù)的獨立循環(huán)并使用靜態(tài)調(diào)度以相同數(shù)量的線程執(zhí)行它們,那么每個線程將獲得完全相同的迭代范圍(s) 在兩個平行區(qū)域.這在 NUMA 系統(tǒng)上非常重要:如果您在第一個循環(huán)中接觸一些內(nèi)存,它將駐留在執(zhí)行線程所在的 NUMA 節(jié)點上.然后在第二個循環(huán)中,同一個線程可以更快地訪問同一個內(nèi)存位置,因為它將駐留在同一個 NUMA 節(jié)點上.
static
schedule means that iterations blocks are mapped statically to the execution threads in a round-robin fashion. The nice thing with static scheduling is that OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration range(s) in both parallel regions. This is quite important on NUMA systems: if you touch some memory in the first loop, it will reside on the NUMA node where the executing thread was. Then in the second loop the same thread could access the same memory location faster since it will reside on the same NUMA node.
假設有兩個 NUMA 節(jié)點:節(jié)點 0 和節(jié)點 1,例如一個雙插槽 Intel Nehalem 板,兩個插槽均帶有 4 核 CPU.然后線程 0、1、2 和 3 將駐留在節(jié)點 0 上,線程 4、5、6 和 7 將駐留在節(jié)點 1 上:
Imagine there are two NUMA nodes: node 0 and node 1, e.g. a two-socket Intel Nehalem board with 4-core CPUs in both sockets. Then threads 0, 1, 2, and 3 will reside on node 0 and threads 4, 5, 6, and 7 will reside on node 1:
| | core 0 | thread 0 |
| socket 0 | core 1 | thread 1 |
| NUMA node 0 | core 2 | thread 2 |
| | core 3 | thread 3 |
| | core 4 | thread 4 |
| socket 1 | core 5 | thread 5 |
| NUMA node 1 | core 6 | thread 6 |
| | core 7 | thread 7 |
每個內(nèi)核都可以從每個 NUMA 節(jié)點訪問內(nèi)存,但遠程訪問比本地節(jié)點訪問慢(在 Intel 上慢 1.5 到 1.9 倍).你運行這樣的東西:
Each core can access memory from each NUMA node, but remote access is slower (1.5x - 1.9x slower on Intel) than local node access. You run something like this:
char *a = (char *)malloc(8*4096);
#pragma omp parallel for schedule(static,1) num_threads(8)
for (int i = 0; i < 8; i++)
memset(&a[i*4096], 0, 4096);
在這種情況下,4096 字節(jié)是 x86 上 Linux 上一個內(nèi)存頁的標準大小,如果不使用大頁.此代碼將整個 32 KiB 數(shù)組 a
歸零.malloc()
調(diào)用僅保留虛擬地址空間,但實際上并未接觸"物理內(nèi)存(這是默認行為,除非使用了其他版本的 malloc
,例如一種像 calloc()
那樣將內(nèi)存歸零的方法).現(xiàn)在這個數(shù)組是連續(xù)的,但只在虛擬內(nèi)存中.在物理內(nèi)存中,一半位于連接到插槽 0 的內(nèi)存中,另一半位于連接到插槽 1 的內(nèi)存中.這是因為不同的部分被不同的線程歸零,并且這些線程駐留在不同的內(nèi)核上,并且有一種叫做 first touch NUMA 策略,這意味著內(nèi)存頁分配在第一個觸摸"內(nèi)存頁的線程所在的 NUMA 節(jié)點上.
4096 bytes in this case is the standard size of one memory page on Linux on x86 if huge pages are not used. This code will zero the whole 32 KiB array a
. The malloc()
call just reserves virtual address space but does not actually "touch" the physical memory (this is the default behaviour unless some other version of malloc
is used, e.g. one that zeroes the memory like calloc()
does). Now this array is contiguous but only in virtual memory. In physical memory half of it would lie in the memory attached to socket 0 and half in the memory attached to socket 1. This is so because different parts are zeroed by different threads and those threads reside on different cores and there is something called first touch NUMA policy which means that memory pages are allocated on the NUMA node on which the thread that first "touched" the memory page resides.
| | core 0 | thread 0 | a[0] ... a[4095]
| socket 0 | core 1 | thread 1 | a[4096] ... a[8191]
| NUMA node 0 | core 2 | thread 2 | a[8192] ... a[12287]
| | core 3 | thread 3 | a[12288] ... a[16383]
| | core 4 | thread 4 | a[16384] ... a[20479]
| socket 1 | core 5 | thread 5 | a[20480] ... a[24575]
| NUMA node 1 | core 6 | thread 6 | a[24576] ... a[28671]
| | core 7 | thread 7 | a[28672] ... a[32768]
現(xiàn)在讓我們像這樣運行另一個循環(huán):
Now lets run another loop like this:
#pragma omp parallel for schedule(static,1) num_threads(8)
for (i = 0; i < 8; i++)
memset(&a[i*4096], 1, 4096);
每個線程將訪問已映射的物理內(nèi)存,并且它將具有與第一個循環(huán)期間相同的線程到內(nèi)存區(qū)域的映射.這意味著線程只會訪問位于其本地內(nèi)存塊中的內(nèi)存,這會很快.
Each thread will access the already mapped physical memory and it will have the same mapping of thread to memory region as the one during the first loop. It means that threads will only access memory located in their local memory blocks which will be fast.
現(xiàn)在想象另一個調(diào)度方案用于第二個循環(huán):schedule(static,2)
.這會將迭代空間切割"為兩個迭代的塊,總共會有 4 個這樣的塊.將會發(fā)生的是,我們將有以下線程到內(nèi)存位置映射(通過迭代次數(shù)):
Now imagine that another scheduling scheme is used for the second loop: schedule(static,2)
. This will "chop" iteration space into blocks of two iterations and there will be 4 such blocks in total. What will happen is that we will have the following thread to memory location mapping (through the iteration number):
| | core 0 | thread 0 | a[0] ... a[8191] <- OK, same memory node
| socket 0 | core 1 | thread 1 | a[8192] ... a[16383] <- OK, same memory node
| NUMA node 0 | core 2 | thread 2 | a[16384] ... a[24575] <- Not OK, remote memory
| | core 3 | thread 3 | a[24576] ... a[32768] <- Not OK, remote memory
| | core 4 | thread 4 | <idle>
| socket 1 | core 5 | thread 5 | <idle>
| NUMA node 1 | core 6 | thread 6 | <idle>
| | core 7 | thread 7 | <idle>
這里發(fā)生了兩件壞事:
- 線程 4 到 7 保持空閑,一半的計算能力丟失;
- 線程 2 和 3 訪問非本地內(nèi)存,它們將花費大約兩倍的時間來完成,在此期間線程 0 和 1 將保持空閑.
所以使用靜態(tài)調(diào)度的優(yōu)點之一是它提高了內(nèi)存訪問的局部性.缺點是調(diào)度參數(shù)選擇不當會影響性能.
So one of the advantages for using static scheduling is that it improves locality in memory access. The disadvantage is that bad choice of scheduling parameters can ruin the performance.
dynamic
調(diào)度基于先到先得"的原則.具有相同線程數(shù)的兩次運行可能(并且很可能會)產(chǎn)生完全不同的迭代空間"->線程"映射,因為可以輕松驗證:
dynamic
scheduling works on a "first come, first served" basis. Two runs with the same number of threads might (and most likely would) produce completely different "iteration space" -> "threads" mappings as one can easily verify:
$ cat dyn.c
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i;
#pragma omp parallel num_threads(8)
{
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[1] iter %0d, tid %0d
", i, omp_get_thread_num());
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[2] iter %0d, tid %0d
", i, omp_get_thread_num());
}
return 0;
}
$ icc -openmp -o dyn.x dyn.c
$ OMP_NUM_THREADS=8 ./dyn.x | sort
[1] iter 0, tid 2
[1] iter 1, tid 0
[1] iter 2, tid 7
[1] iter 3, tid 3
[1] iter 4, tid 4
[1] iter 5, tid 1
[1] iter 6, tid 6
[1] iter 7, tid 5
[2] iter 0, tid 0
[2] iter 1, tid 2
[2] iter 2, tid 7
[2] iter 3, tid 3
[2] iter 4, tid 6
[2] iter 5, tid 1
[2] iter 6, tid 5
[2] iter 7, tid 4
(當使用 gcc
代替時觀察到相同的行為)
(same behaviour is observed when gcc
is used instead)
如果 static
部分的示例代碼使用 dynamic
調(diào)度運行,則只有 1/70 (1.4%) 的機會保留原始位置和 69/70 (98.6%) 的機會發(fā)生遠程訪問.這個事實經(jīng)常被忽視,因此實現(xiàn)了次優(yōu)的性能.
If the sample code from the static
section was run with dynamic
scheduling instead there will be only 1/70 (1.4%) chance that the original locality would be preserved and 69/70 (98.6%) chance that remote access would occur. This fact is often overlooked and hence suboptimal performance is achieved.
在靜態(tài)
和動態(tài)
調(diào)度之間進行選擇還有另一個原因——工作負載平衡.如果每次迭代所花費的時間與完成的平均時間相差很大,那么在靜態(tài)情況下可能會出現(xiàn)高度的工作不平衡.以完成一次迭代的時間隨迭代次數(shù)線性增長的情況為例.如果迭代空間在兩個線程之間靜態(tài)劃分,則第二個線程的工作量將是第一個線程的三倍,因此在 2/3 的計算時間中,第一個線程將處于空閑狀態(tài).動態(tài)調(diào)度引入了一些額外的開銷,但在這種特殊情況下會導致更好的工作負載分配.一種特殊的動態(tài)
調(diào)度是guided
,隨著工作的進行,越來越小的迭代塊被分配給每個任務.
There is another reason to choose between static
and dynamic
scheduling - workload balancing. If each iteration takes vastly different from the mean time to be completed then high work imbalance might occur in the static case. Take as an example the case where time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of dynamic
scheduling is the guided
where smaller and smaller iteration blocks are given to each task as the work progresses.
由于預編譯代碼可以在各種平臺上運行,如果最終用戶可以控制調(diào)度,那就太好了.這就是 OpenMP 提供特殊 schedule(runtime)
子句的原因.對于 runtime
調(diào)度,類型取自環(huán)境變量 OMP_SCHEDULE
的內(nèi)容.這允許在不重新編譯應用程序的情況下測試不同的調(diào)度類型,還允許最終用戶針對他或她的平臺進行微調(diào).
Since precompiled code could be run on various platforms it would be nice if the end user can control the scheduling. That's why OpenMP provides the special schedule(runtime)
clause. With runtime
scheduling the type is taken from the content of the environment variable OMP_SCHEDULE
. This allows to test different scheduling types without recompiling the application and also allows the end user to fine-tune for his or her platform.
這篇關于“靜態(tài)"和“靜態(tài)"有什么區(qū)別?和“動態(tài)"在 OpenMP 中安排?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!