問(wèn)題描述
我的 C++ 代碼使用 SSE,現(xiàn)在我想改進(jìn)它以在可用時(shí)支持 AVX.所以我檢測(cè) AVX 何時(shí)可用并調(diào)用一個(gè)使用 AVX 命令的函數(shù).我使用 Win7 SP1 + VS2010 SP1 和一個(gè)帶 AVX 的 CPU.
My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX.
要使用 AVX,必須包含這個(gè):
To use AVX, it is necessary to include this:
#include "immintrin.h"
然后你可以使用內(nèi)在的 AVX 函數(shù),比如 _mm256_mul_ps
、_mm256_add_ps
等.問(wèn)題是,默認(rèn)情況下,VS2010 生成的代碼運(yùn)行速度非常慢并顯示警告:
and then you can use intrinsics AVX functions like _mm256_mul_ps
, _mm256_add_ps
etc.
The problem is that by default, VS2010 produces code that works very slowly and shows the warning:
警告 C4752:發(fā)現(xiàn)英特爾(R) 高級(jí)矢量擴(kuò)展;考慮使用/arch:AVX
warning C4752: found Intel(R) Advanced Vector Extensions; consider using /arch:AVX
似乎 VS2010 實(shí)際上不使用 AVX 指令,而是模擬它們.我在編譯器選項(xiàng)中添加了 /arch:AVX
并得到了不錯(cuò)的結(jié)果.但是這個(gè)選項(xiàng)告訴編譯器在可能的情況下在任何地方使用 AVX 命令.所以我的代碼可能會(huì)在不支持 AVX 的 CPU 上崩潰!
It seems VS2010 actually does not use AVX instructions, but instead, emulates them. I added /arch:AVX
to the compiler options and got good results. But this option tells the compiler to use AVX commands everywhere when possible. So my code may crash on CPU that does not support AVX!
所以問(wèn)題是如何讓 VS2010 編譯器生成 AVX 代碼,但只有當(dāng)我直接指定 AVX 內(nèi)在函數(shù)時(shí).對(duì)于 SSE,它可以工作,我只使用 SSE 內(nèi)在函數(shù)并生成 SSE 代碼,而無(wú)需任何編譯器選項(xiàng),例如 /arch:SSE
.但是對(duì)于 AVX,由于某種原因它不起作用.
So the question is how to make VS2010 compiler to produce AVX code but only when I specify AVX intrinsics directly. For SSE it works, I just use SSE intrinsics functions and it produce SSE code without any compiler options like /arch:SSE
. But for AVX it does not work for some reason.
推薦答案
2021 更新:現(xiàn)代版本的 MSVC 不需要手動(dòng)使用 _mm256_zeroupper()
即使在沒(méi)有 的情況下編譯 AVX 內(nèi)部函數(shù)/arch:AVX
.VS2010 做到了.
2021 update: Modern versions of MSVC don't need manual use of _mm256_zeroupper()
even when compiling AVX intrinsics without /arch:AVX
. VS2010 did.
您所看到的行為是昂貴的狀態(tài)切換的結(jié)果.
The behavior that you are seeing is the result of expensive state-switching.
請(qǐng)參閱 Agner Fog 手冊(cè)的第 102 頁(yè):
See page 102 of Agner Fog's manual:
http://www.agner.org/optimize/microarchitecture.pdf
每次您在 SSE 和 AVX 指令之間不正確地來(lái)回切換時(shí),您都將付出極高的 (~70) 周期損失.
Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.
當(dāng)你在沒(méi)有 /arch:AVX
的情況下編譯時(shí),VS2010 將生成 SSE 指令,但在你有 AVX 內(nèi)在函數(shù)的任何地方仍然會(huì)使用 AVX.因此,您將獲得同時(shí)具有 SSE 和 AVX 指令的代碼 - 這將具有那些狀態(tài)切換懲罰.(VS2010 知道這一點(diǎn),所以它會(huì)發(fā)出您看到的警告.)
When you compile without /arch:AVX
, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)
因此,您應(yīng)該全部使用 SSE,或全部使用 AVX.指定 /arch:AVX
告訴編譯器使用所有 AVX.
Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX
tells the compiler to use all AVX.
聽(tīng)起來(lái)您正在嘗試創(chuàng)建多個(gè)代碼路徑:一個(gè)用于 SSE,另一個(gè)用于 AVX.為此,我建議您將 SSE 和 AVX 代碼分成兩個(gè)不同的編譯單元.(一個(gè)用 /arch:AVX
編譯,一個(gè)不用)然后將它們鏈接在一起,讓調(diào)度程序根據(jù)運(yùn)行的硬件進(jìn)行選擇.
It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX.
For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX
and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.
如果您需要混合 SSE 和 AVX,請(qǐng)務(wù)必使用 _mm256_zeroupper()
或 _mm256_zeroall()
> 適當(dāng)?shù)乇苊鉅顟B(tài)轉(zhuǎn)換懲罰.
If you need to mix SSE and AVX, be sure to use _mm256_zeroupper()
or _mm256_zeroall()
appropriately to avoid the state-switching penalties.
這篇關(guān)于使用 AVX CPU 指令:沒(méi)有“/arch:AVX"的性能不佳的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!