久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

“靜態(tài)"和“靜態(tài)"有什么區(qū)別?和“動(dòng)態(tài)

What#39;s the difference between quot;staticquot; and quot;dynamicquot; schedule in OpenMP?(“靜態(tài)和“靜態(tài)有什么區(qū)別?和“動(dòng)態(tài)在 OpenMP 中安排?)
本文介紹了“靜態(tài)"和“靜態(tài)"有什么區(qū)別?和“動(dòng)態(tài)"在 OpenMP 中安排?的處理方法,對(duì)大家解決問(wèn)題具有一定的參考價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)吧!

問(wèn)題描述

我開(kāi)始使用 C++ 使用 OpenMP.

I started working with OpenMP using C++.

我有兩個(gè)問(wèn)題:

  1. 什么是#pragma omp for schedule?
  2. dynamicstatic 有什么區(qū)別?
  1. What is #pragma omp for schedule?
  2. What is the difference between dynamic and static?

請(qǐng)舉例說(shuō)明.

推薦答案

其他人已經(jīng)回答了大部分問(wèn)題,但我想指出一些特定的情況,其中特定的調(diào)度類(lèi)型比其他的更適合.調(diào)度控制如何在線程之間劃分循環(huán)迭代.選擇正確的時(shí)間表會(huì)對(duì)應(yīng)用程序的速度產(chǎn)生很大影響.

Others have since answered most of the question but I would like to point to some specific cases where a particular scheduling type is more suited than the others. Schedule controls how loop iterations are divided among threads. Choosing the right schedule can have great impact on the speed of the application.

static 調(diào)度意味著迭代塊以循環(huán)方式靜態(tài)映射到執(zhí)行線程.靜態(tài)調(diào)度的好處在于,OpenMP 運(yùn)行時(shí)保證如果您有兩個(gè)具有相同迭代次數(shù)的獨(dú)立循環(huán)并使用靜態(tài)調(diào)度以相同數(shù)量的線程執(zhí)行它們,那么每個(gè)線程將獲得完全相同的迭代范圍(s) 在兩個(gè)平行區(qū)域.這在 NUMA 系統(tǒng)上非常重要:如果您在第一個(gè)循環(huán)中接觸一些內(nèi)存,它將駐留在執(zhí)行線程所在的 NUMA 節(jié)點(diǎn)上.然后在第二個(gè)循環(huán)中,同一個(gè)線程可以更快地訪問(wèn)同一個(gè)內(nèi)存位置,因?yàn)樗鼘Ⅰv留在同一個(gè) NUMA 節(jié)點(diǎn)上.

static schedule means that iterations blocks are mapped statically to the execution threads in a round-robin fashion. The nice thing with static scheduling is that OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration range(s) in both parallel regions. This is quite important on NUMA systems: if you touch some memory in the first loop, it will reside on the NUMA node where the executing thread was. Then in the second loop the same thread could access the same memory location faster since it will reside on the same NUMA node.

假設(shè)有兩個(gè) NUMA 節(jié)點(diǎn):節(jié)點(diǎn) 0 和節(jié)點(diǎn) 1,例如一個(gè)雙插槽 Intel Nehalem 板,兩個(gè)插槽均帶有 4 核 CPU.然后線程 0、1、2 和 3 將駐留在節(jié)點(diǎn) 0 上,線程 4、5、6 和 7 將駐留在節(jié)點(diǎn) 1 上:

Imagine there are two NUMA nodes: node 0 and node 1, e.g. a two-socket Intel Nehalem board with 4-core CPUs in both sockets. Then threads 0, 1, 2, and 3 will reside on node 0 and threads 4, 5, 6, and 7 will reside on node 1:

|             | core 0 | thread 0 |
| socket 0    | core 1 | thread 1 |
| NUMA node 0 | core 2 | thread 2 |
|             | core 3 | thread 3 |

|             | core 4 | thread 4 |
| socket 1    | core 5 | thread 5 |
| NUMA node 1 | core 6 | thread 6 |
|             | core 7 | thread 7 |

每個(gè)內(nèi)核都可以從每個(gè) NUMA 節(jié)點(diǎn)訪問(wèn)內(nèi)存,但遠(yuǎn)程訪問(wèn)比本地節(jié)點(diǎn)訪問(wèn)慢(在 Intel 上慢 1.5 到 1.9 倍).你運(yùn)行這樣的東西:

Each core can access memory from each NUMA node, but remote access is slower (1.5x - 1.9x slower on Intel) than local node access. You run something like this:

char *a = (char *)malloc(8*4096);

#pragma omp parallel for schedule(static,1) num_threads(8)
for (int i = 0; i < 8; i++)
   memset(&a[i*4096], 0, 4096);

在這種情況下,4096 字節(jié)是 x86 上 Linux 上一個(gè)內(nèi)存頁(yè)的標(biāo)準(zhǔn)大小,如果不使用大頁(yè).此代碼將整個(gè) 32 KiB 數(shù)組 a 歸零.malloc() 調(diào)用僅保留虛擬地址空間,但實(shí)際上并未接觸"物理內(nèi)存(這是默認(rèn)行為,除非使用了其他版本的 malloc,例如一種像 calloc() 那樣將內(nèi)存歸零的方法).現(xiàn)在這個(gè)數(shù)組是連續(xù)的,但只在虛擬內(nèi)存中.在物理內(nèi)存中,一半位于連接到插槽 0 的內(nèi)存中,另一半位于連接到插槽 1 的內(nèi)存中.這是因?yàn)椴煌牟糠直徊煌木€程歸零,并且這些線程駐留在不同的內(nèi)核上,并且有一種叫做 first touch NUMA 策略,這意味著內(nèi)存頁(yè)分配在第一個(gè)觸摸"內(nèi)存頁(yè)的線程所在的 NUMA 節(jié)點(diǎn)上.

4096 bytes in this case is the standard size of one memory page on Linux on x86 if huge pages are not used. This code will zero the whole 32 KiB array a. The malloc() call just reserves virtual address space but does not actually "touch" the physical memory (this is the default behaviour unless some other version of malloc is used, e.g. one that zeroes the memory like calloc() does). Now this array is contiguous but only in virtual memory. In physical memory half of it would lie in the memory attached to socket 0 and half in the memory attached to socket 1. This is so because different parts are zeroed by different threads and those threads reside on different cores and there is something called first touch NUMA policy which means that memory pages are allocated on the NUMA node on which the thread that first "touched" the memory page resides.

|             | core 0 | thread 0 | a[0]     ... a[4095]
| socket 0    | core 1 | thread 1 | a[4096]  ... a[8191]
| NUMA node 0 | core 2 | thread 2 | a[8192]  ... a[12287]
|             | core 3 | thread 3 | a[12288] ... a[16383]

|             | core 4 | thread 4 | a[16384] ... a[20479]
| socket 1    | core 5 | thread 5 | a[20480] ... a[24575]
| NUMA node 1 | core 6 | thread 6 | a[24576] ... a[28671]
|             | core 7 | thread 7 | a[28672] ... a[32768]

現(xiàn)在讓我們像這樣運(yùn)行另一個(gè)循環(huán):

Now lets run another loop like this:

#pragma omp parallel for schedule(static,1) num_threads(8)
for (i = 0; i < 8; i++)
   memset(&a[i*4096], 1, 4096);

每個(gè)線程將訪問(wèn)已映射的物理內(nèi)存,并且它將具有與第一個(gè)循環(huán)期間相同的線程到內(nèi)存區(qū)域的映射.這意味著線程只會(huì)訪問(wèn)位于其本地內(nèi)存塊中的內(nèi)存,這會(huì)很快.

Each thread will access the already mapped physical memory and it will have the same mapping of thread to memory region as the one during the first loop. It means that threads will only access memory located in their local memory blocks which will be fast.

現(xiàn)在想象另一個(gè)調(diào)度方案用于第二個(gè)循環(huán):schedule(static,2).這會(huì)將迭代空間切割"為兩個(gè)迭代的塊,總共會(huì)有 4 個(gè)這樣的塊.將會(huì)發(fā)生的是,我們將有以下線程到內(nèi)存位置映射(通過(guò)迭代次數(shù)):

Now imagine that another scheduling scheme is used for the second loop: schedule(static,2). This will "chop" iteration space into blocks of two iterations and there will be 4 such blocks in total. What will happen is that we will have the following thread to memory location mapping (through the iteration number):

|             | core 0 | thread 0 | a[0]     ... a[8191]  <- OK, same memory node
| socket 0    | core 1 | thread 1 | a[8192]  ... a[16383] <- OK, same memory node
| NUMA node 0 | core 2 | thread 2 | a[16384] ... a[24575] <- Not OK, remote memory
|             | core 3 | thread 3 | a[24576] ... a[32768] <- Not OK, remote memory

|             | core 4 | thread 4 | <idle>
| socket 1    | core 5 | thread 5 | <idle>
| NUMA node 1 | core 6 | thread 6 | <idle>
|             | core 7 | thread 7 | <idle>

這里發(fā)生了兩件壞事:

  • 線程 4 到 7 保持空閑,一半的計(jì)算能力丟失;
  • 線程 2 和 3 訪問(wèn)非本地內(nèi)存,它們將花費(fèi)大約兩倍的時(shí)間來(lái)完成,在此期間線程 0 和 1 將保持空閑.

所以使用靜態(tài)調(diào)度的優(yōu)點(diǎn)之一是它提高了內(nèi)存訪問(wèn)的局部性.缺點(diǎn)是調(diào)度參數(shù)選擇不當(dāng)會(huì)影響性能.

So one of the advantages for using static scheduling is that it improves locality in memory access. The disadvantage is that bad choice of scheduling parameters can ruin the performance.

dynamic 調(diào)度基于先到先得"的原則.具有相同線程數(shù)的兩次運(yùn)行可能(并且很可能會(huì))產(chǎn)生完全不同的迭代空間"->線程"映射,因?yàn)榭梢暂p松驗(yàn)證:

dynamic scheduling works on a "first come, first served" basis. Two runs with the same number of threads might (and most likely would) produce completely different "iteration space" -> "threads" mappings as one can easily verify:

$ cat dyn.c
#include <stdio.h>
#include <omp.h>

int main (void)
{
  int i;

  #pragma omp parallel num_threads(8)
  {
    #pragma omp for schedule(dynamic,1)
    for (i = 0; i < 8; i++)
      printf("[1] iter %0d, tid %0d
", i, omp_get_thread_num());

    #pragma omp for schedule(dynamic,1)
    for (i = 0; i < 8; i++)
      printf("[2] iter %0d, tid %0d
", i, omp_get_thread_num());
  }

  return 0;
}

$ icc -openmp -o dyn.x dyn.c

$ OMP_NUM_THREADS=8 ./dyn.x | sort
[1] iter 0, tid 2
[1] iter 1, tid 0
[1] iter 2, tid 7
[1] iter 3, tid 3
[1] iter 4, tid 4
[1] iter 5, tid 1
[1] iter 6, tid 6
[1] iter 7, tid 5
[2] iter 0, tid 0
[2] iter 1, tid 2
[2] iter 2, tid 7
[2] iter 3, tid 3
[2] iter 4, tid 6
[2] iter 5, tid 1
[2] iter 6, tid 5
[2] iter 7, tid 4

(當(dāng)使用 gcc 代替時(shí)觀察到相同的行為)

(same behaviour is observed when gcc is used instead)

如果 static 部分的示例代碼使用 dynamic 調(diào)度運(yùn)行,則只有 1/70 (1.4%) 的機(jī)會(huì)保留原始位置和 69/70 (98.6%) 的機(jī)會(huì)發(fā)生遠(yuǎn)程訪問(wèn).這個(gè)事實(shí)經(jīng)常被忽視,因此實(shí)現(xiàn)了次優(yōu)的性能.

If the sample code from the static section was run with dynamic scheduling instead there will be only 1/70 (1.4%) chance that the original locality would be preserved and 69/70 (98.6%) chance that remote access would occur. This fact is often overlooked and hence suboptimal performance is achieved.

靜態(tài)動(dòng)態(tài)調(diào)度之間進(jìn)行選擇還有另一個(gè)原因——工作負(fù)載平衡.如果每次迭代所花費(fèi)的時(shí)間與完成的平均時(shí)間相差很大,那么在靜態(tài)情況下可能會(huì)出現(xiàn)高度的工作不平衡.以完成一次迭代的時(shí)間隨迭代次數(shù)線性增長(zhǎng)的情況為例.如果迭代空間在兩個(gè)線程之間靜態(tài)劃分,則第二個(gè)線程的工作量將是第一個(gè)線程的三倍,因此在 2/3 的計(jì)算時(shí)間中,第一個(gè)線程將處于空閑狀態(tài).動(dòng)態(tài)調(diào)度引入了一些額外的開(kāi)銷(xiāo),但在這種特殊情況下會(huì)導(dǎo)致更好的工作負(fù)載分配.一種特殊的動(dòng)態(tài)調(diào)度是guided,隨著工作的進(jìn)行,越來(lái)越小的迭代塊被分配給每個(gè)任務(wù).

There is another reason to choose between static and dynamic scheduling - workload balancing. If each iteration takes vastly different from the mean time to be completed then high work imbalance might occur in the static case. Take as an example the case where time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of dynamic scheduling is the guided where smaller and smaller iteration blocks are given to each task as the work progresses.

由于預(yù)編譯代碼可以在各種平臺(tái)上運(yùn)行,如果最終用戶(hù)可以控制調(diào)度,那就太好了.這就是 OpenMP 提供特殊 schedule(runtime) 子句的原因.對(duì)于 runtime 調(diào)度,類(lèi)型取自環(huán)境變量 OMP_SCHEDULE 的內(nèi)容.這允許在不重新編譯應(yīng)用程序的情況下測(cè)試不同的調(diào)度類(lèi)型,還允許最終用戶(hù)針對(duì)他或她的平臺(tái)進(jìn)行微調(diào).

Since precompiled code could be run on various platforms it would be nice if the end user can control the scheduling. That's why OpenMP provides the special schedule(runtime) clause. With runtime scheduling the type is taken from the content of the environment variable OMP_SCHEDULE. This allows to test different scheduling types without recompiling the application and also allows the end user to fine-tune for his or her platform.

這篇關(guān)于“靜態(tài)"和“靜態(tài)"有什么區(qū)別?和“動(dòng)態(tài)"在 OpenMP 中安排?的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!

【網(wǎng)站聲明】本站部分內(nèi)容來(lái)源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問(wèn)題,如果有圖片或者內(nèi)容侵犯了您的權(quán)益,請(qǐng)聯(lián)系我們刪除處理,感謝您的支持!

相關(guān)文檔推薦

What is the fastest way to transpose a matrix in C++?(在 C++ 中轉(zhuǎn)置矩陣的最快方法是什么?)
Sorting zipped (locked) containers in C++ using boost or the STL(使用 boost 或 STL 在 C++ 中對(duì)壓縮(鎖定)容器進(jìn)行排序)
Rotating a point about another point (2D)(圍繞另一個(gè)點(diǎn)旋轉(zhuǎn)一個(gè)點(diǎn) (2D))
Image Processing: Algorithm Improvement for #39;Coca-Cola Can#39; Recognition(圖像處理:Coca-Cola Can 識(shí)別的算法改進(jìn))
How do I construct an ISO 8601 datetime in C++?(如何在 C++ 中構(gòu)建 ISO 8601 日期時(shí)間?)
Sort list using STL sort function(使用 STL 排序功能對(duì)列表進(jìn)行排序)
主站蜘蛛池模板: 欧美激情一区二区三区 | 日本中文在线 | 黄色免费在线观看 | 精国产品一区二区三区 | 国产探花 | 亚洲福利av| 久久鲁视频 | 亚洲人a | 天天干天天操 | 中文成人在线 | 成人精品一区二区三区 | 久久精品亚洲 | 亚洲国产一区在线 | 日韩一区精品 | 国产成人福利 | 成人a视频在线观看 | 国产成人免费视频网站高清观看视频 | 亚洲一区二区免费 | 一区二区三区视频在线免费观看 | 一区二区高清 | 在线免费观看视频你懂的 | 国产在线一区二区 | 精品中文视频 | 国家一级黄色片 | 在线电影日韩 | 久久99精品久久久久 | 亚洲一区二区久久久 | 久久久青草婷婷精品综合日韩 | 精品国产一区二区三区久久久蜜月 | 国产丝袜一区二区三区免费视频 | 久久一 | 成人深夜福利在线观看 | 热99在线 | 青青伊人久久 | 新91视频网 | 很黄很污的网站 | 美女一级a毛片免费观看97 | 久草网免费 | 成人av在线播放 | 日韩欧美视频网站 | 日韩免费网站 |