問題描述
我已經(jīng)閱讀了有關(guān)此主題的其他一些問題.然而,他們無論如何都沒有解決我的問題.
I've read some other questions on this topic. However, they didn't solve my problem anyway.
我寫的代碼如下,我得到的 pthread
版本和 omp
版本都比串行版本慢.我很困惑.
I wrote the code as following and I got pthread
version and omp
version both slower than the serial version. I'm very confused.
環(huán)境下編譯:
Ubuntu 12.04 64bit 3.2.0-60-generic
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Vendor ID: AuthenticAMD
CPU family: 18
Model: 1
Stepping: 0
CPU MHz: 800.000
BogoMIPS: 3593.36
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
NUMA node0 CPU(s): 0,1
編譯命令:
g++ -std=c++11 ./eg001.cpp -fopenmp
#include <cmath>
#include <cstdio>
#include <ctime>
#include <omp.h>
#include <pthread.h>
#define NUM_THREADS 5
const int sizen = 256000000;
struct Data {
double * pSinTable;
long tid;
};
void * compute(void * p) {
Data * pDt = (Data *)p;
const int start = sizen * pDt->tid/NUM_THREADS;
const int end = sizen * (pDt->tid + 1)/NUM_THREADS;
for(int n = start; n < end; ++n) {
pDt->pSinTable[n] = std::sin(2 * M_PI * n / sizen);
}
pthread_exit(nullptr);
}
int main()
{
double * sinTable = new double[sizen];
pthread_t threads[NUM_THREADS];
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
clock_t start, finish;
start = clock();
int rc;
Data dt[NUM_THREADS];
for(int i = 0; i < NUM_THREADS; ++i) {
dt[i].pSinTable = sinTable;
dt[i].tid = i;
rc = pthread_create(&threads[i], &attr, compute, &dt[i]);
}//for
pthread_attr_destroy(&attr);
for(int i = 0; i < NUM_THREADS; ++i) {
rc = pthread_join(threads[i], nullptr);
}//for
finish = clock();
printf("from pthread: %lf
", (double)(finish - start)/CLOCKS_PER_SEC);
delete sinTable;
sinTable = new double[sizen];
start = clock();
# pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
finish = clock();
printf("from omp: %lf
", (double)(finish - start)/CLOCKS_PER_SEC);
delete sinTable;
sinTable = new double[sizen];
start = clock();
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
finish = clock();
printf("from serial: %lf
", (double)(finish - start)/CLOCKS_PER_SEC);
delete sinTable;
pthread_exit(nullptr);
return 0;
}
輸出:
from pthread: 21.150000
from omp: 20.940000
from serial: 20.800000
不知道是不是我代碼的問題,所以我用pthread來做同樣的事情.
I wonder whether it was my code's problem so I used pthread to do the same thing.
然而,我完全錯(cuò)了,我想知道這是否可能是 Ubuntu 在 OpenMP/pthread 上的問題.
However, I'm totally wrong, and I wonder whether it might be Ubuntu's problem on OpenMP/pthread.
我有一個(gè)朋友也有 AMD CPU 和 Ubuntu 12.04,在那里遇到了同樣的問題,所以我可能有理由相信問題不僅限于我.
I have a friend who has AMD CPU and Ubuntu 12.04 as well, and got the same problem there, so I might have some reason to believe that the problem is not limited to only me.
如果有人和我有同樣的問題,或者對這個(gè)問題有一些線索,提前致謝.
If anyone has the same problem as me, or has some clue on the problem, thanks in advance.
如果代碼不夠好,我運(yùn)行了一個(gè)基準(zhǔn)測試并將結(jié)果粘貼在這里:
If the code is not good enough, I ran a benchmark and I pasted the result here:
http://pastebin.com/RquLPREc
基準(zhǔn)網(wǎng)址:http://www.cs.kent.edu/~farrell/mc08/lectures/progs/openmp/microBenchmarks/src/download.html
新信息:
我使用 VS2012 在 windows(沒有 pthread 版本)上運(yùn)行代碼.
I ran the code on windows (without pthread version) with VS2012.
我使用了 sizen 的 1/10,因?yàn)?windows 不允許我分配大內(nèi)存主干的結(jié)果:
I used 1/10 of sizen because windows does not allow me to allocate that great trunk of memory where the results are:
from omp: 1.004
from serial: 1.420
from FreeNickName: 735 (this one is the suggestion improvement by @FreeNickName)
這是否表明它可能是 Ubuntu OS
的問題??
Does this indicate that it could be a problem of Ubuntu OS
??
問題通過使用在操作系統(tǒng)之間可移植的omp_get_wtime
函數(shù)解決.請參閱 Hristo Iliev
的答案.
Problem is solved by using omp_get_wtime
function that is portable among Operating Systems. See the answer by Hristo Iliev
.
FreeNickName
對這個(gè)有爭議的話題進(jìn)行了一些測試.
Some tests about the controversial topic by FreeNickName
.
(抱歉,我需要在 Ubuntu 上測試它,因?yàn)?Windows 是我的朋友之一.)
(Sorry I need to test it on Ubuntu cause the windows was one of my friends'.)
--1-- 從 delete
更改為 delete []
: (但沒有 memset)(-std=c++11 -fopenmp)
--1-- Change from delete
to delete []
: (but without memset)(-std=c++11 -fopenmp)
from pthread: 13.491405
from omp: 13.023099
from serial: 20.665132
from FreeNickName: 12.022501
--2-- 在 new 之后立即使用 memset:(-std=c++11 -fopenmp)
--2-- With memset immediately after new: (-std=c++11 -fopenmp)
from pthread: 13.996505
from omp: 13.192444
from serial: 19.882127
from FreeNickName: 12.541723
--3-- 在 new 之后立即使用 memset:(-std=c++11 -fopenmp -march=native -O2)
--3-- With memset immediately after new: (-std=c++11 -fopenmp -march=native -O2)
from pthread: 11.886978
from omp: 11.351801
from serial: 17.002865
from FreeNickName: 11.198779
--4-- 在 new 之后立即使用 memset,并將 FreeNickName 的版本放在 OMP 之前用于版本:(-std=c++11 -fopenmp -march=native -O2)
--4-- With memset immediately after new, and put FreeNickName's version before OMP for version: (-std=c++11 -fopenmp -march=native -O2)
from pthread: 11.831127
from FreeNickName: 11.571595
from omp: 11.932814
from serial: 16.976979
--5-- 在 new 之后立即使用 memset,并將 FreeNickName 的版本放在 OMP for version 之前,并將 NUM_THREADS
設(shè)置為 5 而不是 2(我是雙核).
--5-- With memset immediately after new, and put FreeNickName's version before OMP for version, and set NUM_THREADS
to 5 instead of 2 (I'm dual core).
from pthread: 9.451775
from FreeNickName: 9.385366
from omp: 11.854656
from serial: 16.960101
推薦答案
在您的情況下,OpenMP 沒有任何問題.問題在于您測量經(jīng)過的時(shí)間的方式.
There is nothing wrong with OpenMP in your case. What is wrong is the way you measure the elapsed time.
使用 clock()
測量 Linux(和大多數(shù)其他類 Unix 操作系統(tǒng))上多線程應(yīng)用程序的性能是一個(gè)錯(cuò)誤,因?yàn)樗环祷貟扃?實(shí)時(shí))時(shí)間,而是返回所有進(jìn)程線程的累積 CPU 時(shí)間(在某些 Unix 風(fēng)格上甚至是所有子進(jìn)程的累積 CPU 時(shí)間).您的并行代碼在 Windows 上顯示出更好的性能,因?yàn)?clock()
返回的是實(shí)時(shí)時(shí)間,而不是累積的 CPU 時(shí)間.
Using clock()
to measure the performance of multithreaded applications on Linux (and most other Unix-like OSes) is a mistake since it does not return the wall-clock (real) time but instead the accumulated CPU time for all process threads (and on some Unix flavours even the accumulated CPU time for all child processes). Your parallel code shows better performance on Windows since there clock()
returns the real time and not the accumulated CPU time.
防止此類差異的最佳方法是使用可移植的 OpenMP 計(jì)時(shí)器例程 omp_get_wtime()
:
The best way to prevent such discrepancies is to use the portable OpenMP timer routine omp_get_wtime()
:
double start = omp_get_wtime();
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
double finish = omp_get_wtime();
printf("from omp: %lf
", finish - start);
對于非 OpenMP 應(yīng)用程序,您應(yīng)該使用 clock_gettime()
和 CLOCK_REALTIME
時(shí)鐘:
For non-OpenMP applications, you should use clock_gettime()
with the CLOCK_REALTIME
clock:
struct timespec start, finish;
clock_gettime(CLOCK_REALTIME, &start);
#pragma omp parallel for
for(int n = 0; n < sizen; ++n)
sinTable[n] = std::sin(2 * M_PI * n / sizen);
clock_gettime(CLOCK_REALTIME, &finish);
printf("from omp: %lf
", (finish.tv_sec + 1.e-9 * finish.tv_nsec) -
(start.tv_sec + 1.e-9 * start.tv_nsec));
這篇關(guān)于為什么ubuntu 12.04下的OpenMP比串口版慢的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!