問(wèn)題描述
我開(kāi)始使用 C++ 使用 OpenMP.
I started working with OpenMP using C++.
我有兩個(gè)問(wèn)題:
- 什么是
#pragma omp for schedule
? dynamic
和static
有什么區(qū)別?
- What is
#pragma omp for schedule
? - What is the difference between
dynamic
andstatic
?
請(qǐng)舉例說(shuō)明.
推薦答案
其他人已經(jīng)回答了大部分問(wèn)題,但我想指出一些特定的情況,其中特定的調(diào)度類(lèi)型比其他的更適合.調(diào)度控制如何在線程之間劃分循環(huán)迭代.選擇正確的時(shí)間表會(huì)對(duì)應(yīng)用程序的速度產(chǎn)生很大影響.
Others have since answered most of the question but I would like to point to some specific cases where a particular scheduling type is more suited than the others. Schedule controls how loop iterations are divided among threads. Choosing the right schedule can have great impact on the speed of the application.
static
調(diào)度意味著迭代塊以循環(huán)方式靜態(tài)映射到執(zhí)行線程.靜態(tài)調(diào)度的好處在于,OpenMP 運(yùn)行時(shí)保證如果您有兩個(gè)具有相同迭代次數(shù)的獨(dú)立循環(huán)并使用靜態(tài)調(diào)度以相同數(shù)量的線程執(zhí)行它們,那么每個(gè)線程將獲得完全相同的迭代范圍(s) 在兩個(gè)平行區(qū)域.這在 NUMA 系統(tǒng)上非常重要:如果您在第一個(gè)循環(huán)中接觸一些內(nèi)存,它將駐留在執(zhí)行線程所在的 NUMA 節(jié)點(diǎn)上.然后在第二個(gè)循環(huán)中,同一個(gè)線程可以更快地訪問(wèn)同一個(gè)內(nèi)存位置,因?yàn)樗鼘Ⅰv留在同一個(gè) NUMA 節(jié)點(diǎn)上.
static
schedule means that iterations blocks are mapped statically to the execution threads in a round-robin fashion. The nice thing with static scheduling is that OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration range(s) in both parallel regions. This is quite important on NUMA systems: if you touch some memory in the first loop, it will reside on the NUMA node where the executing thread was. Then in the second loop the same thread could access the same memory location faster since it will reside on the same NUMA node.
假設(shè)有兩個(gè) NUMA 節(jié)點(diǎn):節(jié)點(diǎn) 0 和節(jié)點(diǎn) 1,例如一個(gè)雙插槽 Intel Nehalem 板,兩個(gè)插槽均帶有 4 核 CPU.然后線程 0、1、2 和 3 將駐留在節(jié)點(diǎn) 0 上,線程 4、5、6 和 7 將駐留在節(jié)點(diǎn) 1 上:
Imagine there are two NUMA nodes: node 0 and node 1, e.g. a two-socket Intel Nehalem board with 4-core CPUs in both sockets. Then threads 0, 1, 2, and 3 will reside on node 0 and threads 4, 5, 6, and 7 will reside on node 1:
| | core 0 | thread 0 |
| socket 0 | core 1 | thread 1 |
| NUMA node 0 | core 2 | thread 2 |
| | core 3 | thread 3 |
| | core 4 | thread 4 |
| socket 1 | core 5 | thread 5 |
| NUMA node 1 | core 6 | thread 6 |
| | core 7 | thread 7 |
每個(gè)內(nèi)核都可以從每個(gè) NUMA 節(jié)點(diǎn)訪問(wèn)內(nèi)存,但遠(yuǎn)程訪問(wèn)比本地節(jié)點(diǎn)訪問(wèn)慢(在 Intel 上慢 1.5 到 1.9 倍).你運(yùn)行這樣的東西:
Each core can access memory from each NUMA node, but remote access is slower (1.5x - 1.9x slower on Intel) than local node access. You run something like this:
char *a = (char *)malloc(8*4096);
#pragma omp parallel for schedule(static,1) num_threads(8)
for (int i = 0; i < 8; i++)
memset(&a[i*4096], 0, 4096);
在這種情況下,4096 字節(jié)是 x86 上 Linux 上一個(gè)內(nèi)存頁(yè)的標(biāo)準(zhǔn)大小,如果不使用大頁(yè).此代碼將整個(gè) 32 KiB 數(shù)組 a
歸零.malloc()
調(diào)用僅保留虛擬地址空間,但實(shí)際上并未接觸"物理內(nèi)存(這是默認(rèn)行為,除非使用了其他版本的 malloc
,例如一種像 calloc()
那樣將內(nèi)存歸零的方法).現(xiàn)在這個(gè)數(shù)組是連續(xù)的,但只在虛擬內(nèi)存中.在物理內(nèi)存中,一半位于連接到插槽 0 的內(nèi)存中,另一半位于連接到插槽 1 的內(nèi)存中.這是因?yàn)椴煌牟糠直徊煌木€程歸零,并且這些線程駐留在不同的內(nèi)核上,并且有一種叫做 first touch NUMA 策略,這意味著內(nèi)存頁(yè)分配在第一個(gè)觸摸"內(nèi)存頁(yè)的線程所在的 NUMA 節(jié)點(diǎn)上.
4096 bytes in this case is the standard size of one memory page on Linux on x86 if huge pages are not used. This code will zero the whole 32 KiB array a
. The malloc()
call just reserves virtual address space but does not actually "touch" the physical memory (this is the default behaviour unless some other version of malloc
is used, e.g. one that zeroes the memory like calloc()
does). Now this array is contiguous but only in virtual memory. In physical memory half of it would lie in the memory attached to socket 0 and half in the memory attached to socket 1. This is so because different parts are zeroed by different threads and those threads reside on different cores and there is something called first touch NUMA policy which means that memory pages are allocated on the NUMA node on which the thread that first "touched" the memory page resides.
| | core 0 | thread 0 | a[0] ... a[4095]
| socket 0 | core 1 | thread 1 | a[4096] ... a[8191]
| NUMA node 0 | core 2 | thread 2 | a[8192] ... a[12287]
| | core 3 | thread 3 | a[12288] ... a[16383]
| | core 4 | thread 4 | a[16384] ... a[20479]
| socket 1 | core 5 | thread 5 | a[20480] ... a[24575]
| NUMA node 1 | core 6 | thread 6 | a[24576] ... a[28671]
| | core 7 | thread 7 | a[28672] ... a[32768]
現(xiàn)在讓我們像這樣運(yùn)行另一個(gè)循環(huán):
Now lets run another loop like this:
#pragma omp parallel for schedule(static,1) num_threads(8)
for (i = 0; i < 8; i++)
memset(&a[i*4096], 1, 4096);
每個(gè)線程將訪問(wèn)已映射的物理內(nèi)存,并且它將具有與第一個(gè)循環(huán)期間相同的線程到內(nèi)存區(qū)域的映射.這意味著線程只會(huì)訪問(wèn)位于其本地內(nèi)存塊中的內(nèi)存,這會(huì)很快.
Each thread will access the already mapped physical memory and it will have the same mapping of thread to memory region as the one during the first loop. It means that threads will only access memory located in their local memory blocks which will be fast.
現(xiàn)在想象另一個(gè)調(diào)度方案用于第二個(gè)循環(huán):schedule(static,2)
.這會(huì)將迭代空間切割"為兩個(gè)迭代的塊,總共會(huì)有 4 個(gè)這樣的塊.將會(huì)發(fā)生的是,我們將有以下線程到內(nèi)存位置映射(通過(guò)迭代次數(shù)):
Now imagine that another scheduling scheme is used for the second loop: schedule(static,2)
. This will "chop" iteration space into blocks of two iterations and there will be 4 such blocks in total. What will happen is that we will have the following thread to memory location mapping (through the iteration number):
| | core 0 | thread 0 | a[0] ... a[8191] <- OK, same memory node
| socket 0 | core 1 | thread 1 | a[8192] ... a[16383] <- OK, same memory node
| NUMA node 0 | core 2 | thread 2 | a[16384] ... a[24575] <- Not OK, remote memory
| | core 3 | thread 3 | a[24576] ... a[32768] <- Not OK, remote memory
| | core 4 | thread 4 | <idle>
| socket 1 | core 5 | thread 5 | <idle>
| NUMA node 1 | core 6 | thread 6 | <idle>
| | core 7 | thread 7 | <idle>
這里發(fā)生了兩件壞事:
- 線程 4 到 7 保持空閑,一半的計(jì)算能力丟失;
- 線程 2 和 3 訪問(wèn)非本地內(nèi)存,它們將花費(fèi)大約兩倍的時(shí)間來(lái)完成,在此期間線程 0 和 1 將保持空閑.
所以使用靜態(tài)調(diào)度的優(yōu)點(diǎn)之一是它提高了內(nèi)存訪問(wèn)的局部性.缺點(diǎn)是調(diào)度參數(shù)選擇不當(dāng)會(huì)影響性能.
So one of the advantages for using static scheduling is that it improves locality in memory access. The disadvantage is that bad choice of scheduling parameters can ruin the performance.
dynamic
調(diào)度基于先到先得"的原則.具有相同線程數(shù)的兩次運(yùn)行可能(并且很可能會(huì))產(chǎn)生完全不同的迭代空間"->線程"映射,因?yàn)榭梢暂p松驗(yàn)證:
dynamic
scheduling works on a "first come, first served" basis. Two runs with the same number of threads might (and most likely would) produce completely different "iteration space" -> "threads" mappings as one can easily verify:
$ cat dyn.c
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i;
#pragma omp parallel num_threads(8)
{
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[1] iter %0d, tid %0d
", i, omp_get_thread_num());
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[2] iter %0d, tid %0d
", i, omp_get_thread_num());
}
return 0;
}
$ icc -openmp -o dyn.x dyn.c
$ OMP_NUM_THREADS=8 ./dyn.x | sort
[1] iter 0, tid 2
[1] iter 1, tid 0
[1] iter 2, tid 7
[1] iter 3, tid 3
[1] iter 4, tid 4
[1] iter 5, tid 1
[1] iter 6, tid 6
[1] iter 7, tid 5
[2] iter 0, tid 0
[2] iter 1, tid 2
[2] iter 2, tid 7
[2] iter 3, tid 3
[2] iter 4, tid 6
[2] iter 5, tid 1
[2] iter 6, tid 5
[2] iter 7, tid 4
(當(dāng)使用 gcc
代替時(shí)觀察到相同的行為)
(same behaviour is observed when gcc
is used instead)
如果 static
部分的示例代碼使用 dynamic
調(diào)度運(yùn)行,則只有 1/70 (1.4%) 的機(jī)會(huì)保留原始位置和 69/70 (98.6%) 的機(jī)會(huì)發(fā)生遠(yuǎn)程訪問(wèn).這個(gè)事實(shí)經(jīng)常被忽視,因此實(shí)現(xiàn)了次優(yōu)的性能.
If the sample code from the static
section was run with dynamic
scheduling instead there will be only 1/70 (1.4%) chance that the original locality would be preserved and 69/70 (98.6%) chance that remote access would occur. This fact is often overlooked and hence suboptimal performance is achieved.
在靜態(tài)
和動(dòng)態(tài)
調(diào)度之間進(jìn)行選擇還有另一個(gè)原因——工作負(fù)載平衡.如果每次迭代所花費(fèi)的時(shí)間與完成的平均時(shí)間相差很大,那么在靜態(tài)情況下可能會(huì)出現(xiàn)高度的工作不平衡.以完成一次迭代的時(shí)間隨迭代次數(shù)線性增長(zhǎng)的情況為例.如果迭代空間在兩個(gè)線程之間靜態(tài)劃分,則第二個(gè)線程的工作量將是第一個(gè)線程的三倍,因此在 2/3 的計(jì)算時(shí)間中,第一個(gè)線程將處于空閑狀態(tài).動(dòng)態(tài)調(diào)度引入了一些額外的開(kāi)銷(xiāo),但在這種特殊情況下會(huì)導(dǎo)致更好的工作負(fù)載分配.一種特殊的動(dòng)態(tài)
調(diào)度是guided
,隨著工作的進(jìn)行,越來(lái)越小的迭代塊被分配給每個(gè)任務(wù).
There is another reason to choose between static
and dynamic
scheduling - workload balancing. If each iteration takes vastly different from the mean time to be completed then high work imbalance might occur in the static case. Take as an example the case where time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of dynamic
scheduling is the guided
where smaller and smaller iteration blocks are given to each task as the work progresses.
由于預(yù)編譯代碼可以在各種平臺(tái)上運(yùn)行,如果最終用戶(hù)可以控制調(diào)度,那就太好了.這就是 OpenMP 提供特殊 schedule(runtime)
子句的原因.對(duì)于 runtime
調(diào)度,類(lèi)型取自環(huán)境變量 OMP_SCHEDULE
的內(nèi)容.這允許在不重新編譯應(yīng)用程序的情況下測(cè)試不同的調(diào)度類(lèi)型,還允許最終用戶(hù)針對(duì)他或她的平臺(tái)進(jìn)行微調(diào).
Since precompiled code could be run on various platforms it would be nice if the end user can control the scheduling. That's why OpenMP provides the special schedule(runtime)
clause. With runtime
scheduling the type is taken from the content of the environment variable OMP_SCHEDULE
. This allows to test different scheduling types without recompiling the application and also allows the end user to fine-tune for his or her platform.
這篇關(guān)于“靜態(tài)"和“靜態(tài)"有什么區(qū)別?和“動(dòng)態(tài)"在 OpenMP 中安排?的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!