問題描述
我正在實現一個基于此算法的無鎖隊列,它使用一個計數器來解決 ABA 問題.但我不知道如何用 c++11 CAS 實現這個計數器.例如從算法:
I am implementing a lock-free queue based on this algorithm, which uses a counter to solve the ABA problem. But I don't know how to implement this counter with c++11 CAS. For example, from the algorithm:
E9: if CAS(&tail.ptr->next, next, <node, next.count+1>)
這是一個原子操作,意思是如果tail.ptr->next
等于next
,讓tail.ptr->next
code> 指向 node
并同時(原子地)使 next.count+1
.但是,使用C++11 CAS,我只能實現:
It is an atomic operation, meaning if tail.ptr->next
is equal to next
, let tail.ptr->next
point to node
and simultaneously (atomically) make next.count+1
. However, using C++11 CAS, I can only implement:
std::atomic_compare_exchange_weak(&tail.ptr->next, next, node);
不能使 next.count+1
同時發生.
推薦答案
要通過一次原子操作同時原子地修改兩個東西,您需要將它們放在相鄰的內存中,例如在兩個成員的結構中.然后你可以使用 std::atomic
讓 gcc 發出 在 x86-64 上鎖定 cmpxchg16b
,例如.
To atomically modify two things at once with a single atomic operation, you need to put them in adjacent memory, e.g. in a two-member struct. Then you can use std::atomic<my_struct>
to get gcc to emit lock cmpxchg16b
on x86-64, for example.
為此您不需要內聯匯編,為了避免它,需要一點 C++ 語法痛苦.https://gcc.gnu.org/wiki/DontUseInlineAsm.
You don't need inline asm for this, and it's worth a bit of C++ syntax pain to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm.
不幸的是,對于當前的編譯器,您需要使用 union
來獲得高效的代碼以僅讀取一對中的一個.顯而易見"對結構進行原子加載然后只使用一個成員的方法仍然會導致 lock cmpxchg16b
讀取整個結構,即使我們只需要一個成員.(慢得多,并且弄臟了緩存線,因此讀者與其他讀者競爭).我相信指針的正常 64b 負載仍然可以正確實現 x86 上的獲取內存排序語義(以及原子性),但是當前的編譯器即使對于 std::memory_order_relaxed<也不會進行這種優化/code>,所以我們用聯合來欺騙他們.
Unfortunately, with current compilers, you need to use a union
to get efficient code for reading just one of the pair. The "obvious" way of doing an atomic load of the struct and then only using one member still leads to a lock cmpxchg16b
to read the whole struct even though we only need one member. (Much slower, and dirties the cache line so readers contend with other readers). I'm confident that a normal 64b load of the pointer would still correctly implement the acquire memory-ordering semantics on x86 (as well as atomicity), but current compilers don't do that optimization even for std::memory_order_relaxed
, so we trick them into it with a union.
(提交了 GCC 錯誤 80835 關于此.TODO:同樣適用于如果這是一個有用的想法,請叮一下.)
(submitted GCC bug 80835 about this. TODO: same for clang if this is a useful idea.)
清單:
確保您的編譯器生成有效代碼,以便在只讀情況下僅加載一個成員,而不是一對
lock cmpxchg16b
.例如使用聯合.
確保您的編譯器保證在編寫不同的聯合成員后訪問聯合的一個成員在該實現中具有明確定義的行為.聯合類型雙關在 C99 中是合法的(所以這應該適用于 C11 stdatomic
),但它在 ISO C++11 中是 UB.但是,它在 C++ 的 GNU 方言中是合法的(受 gcc、clang 和 ICC 等支持).
Make sure your compiler guarantees that accessing one member of a union after writing a different union member has well-defined behaviour in that implementation. Union type-punning is legal in in C99 (so this should work well with C11 stdatomic
), but it's UB in ISO C++11. However, it's legal in the GNU dialect of C++ (supported by gcc, clang, and ICC, among others).
確保您的對象是 16B 對齊的,或 32 位指針的 8B 對齊.更一般地說, alignas(2*sizeof(void*))
應該可以工作.未對齊的 lock
ed 指令在 x86 上可能非常,特別是當它們跨越緩存線邊界時.如果對象未對齊,clang3.8 甚至會將其編譯為庫調用.
Make sure your object is 16B-aligned, or 8B-aligned for 32-bit pointers. More generally, alignas(2*sizeof(void*))
should work. Misaligned lock
ed instructions can be very slow on x86, especially if they cross a cache-line boundary. clang3.8 even compiles it to a library call if the object isn't aligned.
使用 -mcx16
編譯 x86-64 版本.cmpxchg16b
不被最早的 x86-64 CPU (AMD K8) 支持,但之后的一切都應該支持.如果沒有 -mcx16
,您會得到一個庫函數調用(可能使用全局鎖).32 位等效項 cmpxchg8b
已經足夠老,現代編譯器假定支持它.(并且可以使用 SSE、MMX 甚至 x87 來執行 64b 原子加載/存儲,因此在讀取一個成員時使用聯合對于獲得良好性能來說不太重要).
Compile with -mcx16
for x86-64 builds. cmpxchg16b
was not supported by the earliest x86-64 CPUs (AMD K8), but should be on everything after that. Without -mcx16
, you get a library function call (which probably uses a global lock). The 32-bit equivalent, cmpxchg8b
, is old enough that modern compilers assume its support. (And can use SSE, MMX, or even x87 to do 64b atomic loads/stores, so using a union is somewhat less important for good performance when reading one member).
確保指針+uintptr_t 原子對象是無鎖的.這對于 x32 和 32 位 ABI(8B 對象)幾乎是有保證的,但對于 16B 對象則不然.例如MSVC 使用 x86-64 的鎖.
Make sure that a pointer+uintptr_t atomic object is lock-free. This is pretty much guaranteed for x32 and 32-bit ABIs (8B object), but not for 16B objects. e.g. MSVC uses a lock for x86-64.
gcc7 及更高版本將調用 libatomic 而不是內聯 lock cmpxchg16b
,并且將從 atomic_is_lock_free
( 的原因包括它太慢了,這不是用戶期望 is_lock_free
的意思),但是至少現在,libatomic 實現仍然在該指令可用的目標上使用 lock cmpxchg16b
.(它甚至可以為只讀原子對象出現段錯誤,所以它真的不理想.更重要的是,讀取器與其他讀取器爭奪對緩存行的獨占訪問.這就是為什么我們要如此麻煩地避免 lockcmpxchg16b
當我們只需要一個 8 字節的一半時,這里的讀取端.)
gcc7 and later will call libatomic instead of inlining lock cmpxchg16b
, and will return false from atomic_is_lock_free
(for reasons including that it's so slow it's not what users expect is_lock_free
to mean), but at least for now the libatomic implementation still uses lock cmpxchg16b
on targets where that instruction is available. (It can even segfault for read-only atomic objects, so it's really not ideal. More importantly, readers contend with other readers for exclusive access to the cache line. That's why we're going to so much trouble to avoid lock cmpxchg16b
for the read side here when we only want one 8-byte half.)
這是一個帶有 CAS 重試循環的代碼示例,它編譯為看起來正確的 asm,并且我認為對于允許聯合類型雙關語的實現,沒有 UB 或其他不安全的 C++.它是用 C 風格編寫的(非成員函數等),但如果你編寫成員函數也一樣.
Here's an example of code with a CAS retry-loop that compiles to asm that looks right, and I think is free of UB or other unsafe C++ for implementations that allow union type punning. It's written in C style (non-member functions, etc.) but it would be the same if you wrote member functions.
請參閱與ASM輸出的來自 Godbolt 編譯器瀏覽器上的 gcc6.3.對于 -m32
,它使用 cmpxchg8b
的方式與 64 位代碼使用 cmpxchg16b
的方式相同.使用 -mx32
(長模式下的 32 位指針),它可以簡單地使用 64 位 cmpxchg
和普通的 64 位整數加載來獲取一個原子中的兩個成員加載.
See the code with asm output from gcc6.3 on the Godbolt compiler explorer. With -m32
, it uses cmpxchg8b
the same way 64-bit code uses cmpxchg16b
. With -mx32
(32-bit pointers in long mode) it can simply use a 64-bit cmpxchg
, and normal 64-bit integer loads to grab both members in one atomic load.
這是可移植的 C++11(聯合類型雙關除外),沒有任何特定于 x86 的內容.但是,它僅在可以 CAS 兩個指針大小的對象的目標上高效.它編譯為對 ARM/ARM64 和 MIPS64 的 __atomic_compare_exchange_16
庫函數的調用,正如您在 Godbolt 上看到的那樣.
This is portable C++11 (except the union type-punning), with nothing x86-specific. It's only efficient on targets that can CAS an object the size of two pointers, though. e.g. it compiles to a call to a __atomic_compare_exchange_16
library function for ARM / ARM64 and MIPS64, as you can see on Godbolt.
它不會在 MSVC 上編譯,其中 atomic
大于 counted_ptr_separate
,所以 static_assert
會捕獲它.據推測,MSVC 在原子對象中包含一個鎖成員.
It doesn't compile on MSVC, where atomic<counted_ptr>
is larger than counted_ptr_separate
, so the static_assert
catches it. Presumably MSVC includes a lock member in the atomic object.
#include <atomic>
#include <stdint.h>
using namespace std;
struct node {
// This alignas is essential for clang to use cmpxchg16b instead of a function call
// Apparently just having it on the union member isn't enough.
struct alignas(2*sizeof(node*)) counted_ptr {
node * ptr;
uintptr_t count; // use pointer-sized integers to avoid padding
};
// hack to allow reading just the pointer without lock-cmpxchg16b,
// but still without any C++ data race
struct counted_ptr_separate {
atomic<node *> ptr;
atomic<uintptr_t> count_separate; // var name emphasizes that accessing this way isn't atomic with ptr
};
static_assert(sizeof(atomic<counted_ptr>) == sizeof(counted_ptr_separate), "atomic<counted_ptr> isn't the same size as the separate version; union type-punning will be bogus");
//static_assert(std::atomic<counted_ptr>{}.is_lock_free());
union { // anonymous union: the members are directly part of struct node
alignas(2*sizeof(node*)) atomic<counted_ptr> next_and_count;
counted_ptr_separate next;
};
// TODO: write member functions to read next.ptr or read/write next_and_count
int data[4];
};
// make sure read-only access is efficient.
node *follow(node *p) { // good asm, just a mov load
return p->next.ptr.load(memory_order_acquire);
}
node *follow_nounion(node *p) { // really bad asm, using cmpxchg16b to load the whole thing
return p->next_and_count.load(memory_order_acquire).ptr;
}
void update_next(node &target, node *desired)
{
// read the old value efficiently to avoid overhead for the no-contention case
// tearing (or stale data from a relaxed load) will just lead to a retry
node::counted_ptr expected = {
target.next.ptr.load(memory_order_relaxed),
target.next.count_separate.load(memory_order_relaxed) };
bool success;
do {
node::counted_ptr newval = { desired, expected.count + 1 };
// x86-64: compiles to cmpxchg16b
success = target.next_and_count.compare_exchange_weak(
expected, newval, memory_order_acq_rel);
// updates exected on failure
} while( !success );
}
clang 4.0 -O3 -mcx16
的 asm 輸出是:
The asm output from clang 4.0 -O3 -mcx16
is:
update_next(node&, node*):
push rbx # cmpxchg16b uses rbx implicitly so it has to be saved/restored
mov rbx, rsi
mov rax, qword ptr [rdi] # load the pointer
mov rdx, qword ptr [rdi + 8] # load the counter
.LBB2_1: # =>This Inner Loop Header: Depth=1
lea rcx, [rdx + 1]
lock
cmpxchg16b xmmword ptr [rdi]
jne .LBB2_1
pop rbx
ret
gcc 做了一些笨重的存儲/重新加載,但基本上是相同的邏輯.
gcc does some clunky store/reloads, but is basically the same logic.
follow(node*)
編譯為 mov rax, [rdi]
/ret
,因此對指針的只讀訪問為由于工會黑客,它應該很便宜.
follow(node*)
compiles to mov rax, [rdi]
/ ret
, so read-only access to the pointer is as cheap as it should be, thanks to the union hack.
它確實依賴于通過一個成員編寫聯合并通過另一個成員讀取它,以便在不使用 lock cmpxchg16b
的情況下高效讀取指針.這保證在 GNU C++(和 ISO C99/C11)中有效,但在 ISO C++ 中無效.許多其他 C++ 編譯器確實保證聯合類型雙關有效,但即使沒有它它也可能仍然有效:我們總是使用 std::atomic
加載,它必須假設值是異步修改的.所以我們應該避免類似別名的問題,即在通過另一個指針(或聯合成員)寫入值后,寄存器中的值仍然被認為是有效的.不過,編譯器認為是獨立的東西的編譯時重新排序可能是一個問題.
It does depend on writing a union through one member and reading it through another, to do efficient reads of just the pointer without using a lock cmpxchg16b
. This is guaranteed to work in GNU C++ (and ISO C99/C11), but not in ISO C++. Many other C++ compilers do guarantee that union type-punning works, but even without that it would probably still work: We're always using std::atomic
loads which have to assume that the value was modified asynchronously. So we should be immune from aliasing-like problems where values in registers are still considered live after writing the value through another pointer (or union member). Compile-time reordering of things the compiler thinks are independent could be a problem, though.
在指針 + 計數器的原子 cmpxchg 之后原子地讀取指針仍然應該為您提供 x86 上的獲取/釋放語義,但我認為 ISO C++ 對此沒有任何說明.猜測廣泛的發布存儲(作為 compare_exchange_weak
的一部分將與大多數架構上來自同一地址的更窄的負載同步(就像在 x86 上一樣),但是 AFAIK C++ std::atomic
不保證任何類型雙關.
Atomically reading just the pointer after an atomic cmpxchg of the pointer+counter should still give you acquire/release semantics on x86, but I don't think ISO C++ says anything about it. I would guess that a wide release-store (as part of the compare_exchange_weak
will synchronize with a narrower load from the same address on most architectures (like it does on x86), but AFAIK C++ std::atomic
doesn't guarantee anything about type-punning.
與指針 + ABA 計數器無關,但可以用于其他使用聯合以允許訪問更大原子對象的子集的應用程序:不要使用聯合來允許原子存儲僅指向指針或者只是柜臺.至少如果您關心與該對的獲取負載同步,則不會.即使是強序 x86 也可以 重新排序一個狹窄的商店,用更寬的負載完全包含它.一切仍然是原子的,但就內存排序而言,你進入了奇怪的領域.
Not relevant for pointer + ABA-counter, but could come up for other applications of using a union to allow access to subsets of larger atomic object: Don't use the union to allow atomic stores to just the pointer or just the counter. At least not if you care about synchronization with an acquire-load of the pair. Even strongly-ordered x86 can reorder a narrow store with a wider load that fully contains it. Everything is still atomic, but you get into weird territory as far as memory-ordering goes.
在 x86-64 上,原子 16B 加載需要 lock cmpxchg16b
(這是一個完整的內存屏障,防止前面的窄存儲在它之后變得全局可見).但是,如果您將其與 32 位指針(或 32 位數組索引)一起使用,則很容易遇到問題,因為可以使用常規 64b 加載來加載兩半.如果您需要與其他線程同步而不僅僅是原子性,我不知道您會在其他架構上看到什么問題.
On x86-64, an atomic 16B load requires a lock cmpxchg16b
(which is a full memory barrier, preventing a preceding narrow store from becoming globally visible after it). But you could easily have a problem if you were using this with 32-bit pointers (or 32-bit array indices), since both halves could be loaded with a regular 64b load. And I have no idea what problems you could see on other architectures, if you need synchronization with other threads and not just atomicity.
要了解有關 std::memory_order 獲取和釋放的更多信息,請參閱 Jeff Preshing 的優秀文章.
這篇關于如何使用 c++11 CAS 實現 ABA 計數器?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!