問題描述
我想知道為什么沒有編譯器準(zhǔn)備將相同值的連續(xù)寫入合并到單個(gè)原子變量,例如:
I'm wondering why no compilers are prepared to merge consecutive writes of the same value to a single atomic variable, e.g.:
#include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
y.store(1, order);
y.store(1, order);
}
我嘗試過的每個(gè)編譯器都會(huì)發(fā)出上述寫入的 3 次.哪個(gè)合法的、無種族的觀察者可以看到上述代碼與經(jīng)過一次寫入的優(yōu)化版本之間的差異(即as-if"規(guī)則不適用)?
Every compiler I've tried will issue the above write three times. What legitimate, race-free observer could see a difference between the above code and an optimized version with a single write (i.e. doesn't the 'as-if' rule apply)?
如果變量是可變的,那么顯然沒有優(yōu)化是適用的.在我的情況下是什么阻止了它?
If the variable had been volatile, then obviously no optimization is applicable. What's preventing it in my case?
這是編譯器資源管理器中的代碼.
推薦答案
C++11/C++14 標(biāo)準(zhǔn)編寫確實(shí)允許將三個(gè)商店折疊/合并為一個(gè)商店的最終值.即使在這樣的情況下:
The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:
y.store(1, order);
y.store(2, order);
y.store(3, order); // inlining + constant-folding could produce this in real code
該標(biāo)準(zhǔn)不保證在 y
上旋轉(zhuǎn)的觀察者(使用原子負(fù)載或 CAS)將永遠(yuǎn)看到 y == 2
.依賴于此的程序?qū)⒕哂袛?shù)據(jù)競(jìng)爭(zhēng)錯(cuò)誤,但只有普通錯(cuò)誤類型的競(jìng)爭(zhēng),而不是 C++ 未定義行為類型的數(shù)據(jù)競(jìng)爭(zhēng).(它只是帶有非原子變量的 UB).一個(gè)希望有時(shí)看到它的程序甚至不一定有缺陷.(見下文:進(jìn)度條.)
The standard does not guarantee that an observer spinning on y
(with an atomic load or CAS) will ever see y == 2
. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)
在 C++ 抽象機(jī)器上可能的任何排序都可以(在編譯時(shí))被選為 總是 發(fā)生的排序.這是實(shí)際中的 as-if 規(guī)則.在這種情況下,好像所有三個(gè)存儲(chǔ)都以全局順序背靠背發(fā)生,在 y=1
和y=3
.
Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1
and y=3
.
它不依賴于目標(biāo)架構(gòu)或硬件;就像編譯時(shí)重新排序一樣,即使在以強(qiáng)序 x86 為目標(biāo).編譯器不必保留您在考慮要編譯的硬件時(shí)可能期望的任何內(nèi)容,因此您需要障礙.屏障可以編譯成零匯編指令.
It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.
這是一個(gè)實(shí)施質(zhì)量問題,可能會(huì)改變?cè)谡鎸?shí)硬件上觀察到的性能/行為.
It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.
最明顯的問題是進(jìn)度條.將存儲(chǔ)從循環(huán)(不包含其他原子操作)中取出并將它們?nèi)空郫B為一個(gè)將導(dǎo)致進(jìn)度條保持在 0,然后在最后變?yōu)?100%.
The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.
沒有 C++11 std::atomic
方法可以阻止他們?cè)谀悴幌胍那闆r下這樣做,所以現(xiàn)在編譯器只需選擇永遠(yuǎn)不要將多個(gè)原子操作合并為一個(gè).(將它們?nèi)亢喜橐粋€(gè)操作不會(huì)改變它們相對(duì)于彼此的順序.)
There's no C++11 std::atomic
way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)
編譯器編寫者已經(jīng)正確地注意到,程序員期望每次源代碼執(zhí)行 y.store()
時(shí),原子存儲(chǔ)實(shí)際上會(huì)發(fā)生在內(nèi)存中.(請(qǐng)參閱此問題的大多數(shù)其他答案,這些答案聲稱商店需要單獨(dú)發(fā)生,因?yàn)榭赡艿淖x者等待看到中間值.)即它違反了 最小驚喜原則.
Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store()
. (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.
但是,在某些情況下它會(huì)非常有用,例如避免在循環(huán)中使用無用的 shared_ptr
ref count inc/dec.
However, there are cases where it would be very helpful, for example avoiding useless shared_ptr
ref count inc/dec in a loop.
顯然,任何重新排序或合并都不能違反任何其他排序規(guī)則.例如,num++;num--;
仍然必須完全阻止運(yùn)行時(shí)和編譯時(shí)重新排序,即使它不再觸及 num
處的內(nèi)存.
Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--;
would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num
.
正在討論擴(kuò)展 std::atomic
API 以讓程序員控制此類優(yōu)化,此時(shí)編譯器將能夠在有用時(shí)進(jìn)行優(yōu)化,從而即使在并非故意低效的精心編寫的代碼中也可能發(fā)生.以下工作組討論/提案鏈接中提到了一些有用的優(yōu)化案例示例:
Discussion is under way to extend the std::atomic
API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:
- http://wg21.link/n4455:N4455 沒有健全的編譯器會(huì)優(yōu)化原子
- http://wg21.link/p0062:WG21/P0062R1:編譯器應(yīng)該何時(shí)優(yōu)化原子?莉>
- http://wg21.link/n4455: N4455 No Sane Compiler Would Optimize Atomics
- http://wg21.link/p0062: WG21/P0062R1: When should compilers optimize atomics?
另請(qǐng)參閱 Richard Hodges 對(duì) int num"的 num++ 可以是原子的嗎?(見評(píng)論).另請(qǐng)參閱同一問題的我的回答的最后一部分,我更詳細(xì)地論證了允許這種優(yōu)化.(在此簡(jiǎn)短,因?yàn)槟切?C++ 工作組鏈接已經(jīng)承認(rèn)當(dāng)前編寫的標(biāo)準(zhǔn)確實(shí)允許這樣做,而且當(dāng)前的編譯器只是沒有故意優(yōu)化.)
See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)
在當(dāng)前標(biāo)準(zhǔn)中,volatile atomic
將是確保不允許對(duì)其進(jìn)行優(yōu)化的一種方法.(正如 Herb Sutter 在 SO 答案中指出的,volatile
和 atomic
已經(jīng)共享了一些需求,但它們是不同的).另請(qǐng)參閱 std::memory_order
與 volatile
在 cppreference 上.
Within the current standard, volatile atomic<int> y
would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile
and atomic
already share some requirements, but they are different). See also std::memory_order
's relationship with volatile
on cppreference.
對(duì) volatile
對(duì)象的訪問不允許被優(yōu)化掉(因?yàn)樗鼈兛赡苁莾?nèi)存映射的 IO 寄存器,例如).
Accesses to volatile
objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).
使用 volatile atomic
主要修復(fù)了進(jìn)度條問題,但如果/當(dāng) C++ 決定使用不同的語(yǔ)法來控制優(yōu)化以便編譯器使用不同的語(yǔ)法時(shí),它有點(diǎn)丑陋并且可能在幾年后看起來很傻可以開始實(shí)踐了.
Using volatile atomic<T>
mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.
我認(rèn)為我們可以確信編譯器不會(huì)開始進(jìn)行這種優(yōu)化,除非有一種方法可以控制它.希望它是某種選擇加入(如 memory_order_release_coalesce
),在編譯為 C++ 時(shí)不會(huì)改變現(xiàn)有代碼 C++11/14 代碼的行為.但它可能類似于 wg21/p0062 中的提議:使用 [[brittle_atomic]]
標(biāo)記不優(yōu)化案例.
I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce
) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]]
.
wg21/p0062 警告說,即使 volatile atomic
也不能解決所有問題,因此不鼓勵(lì)將其用于此目的.它給出了這個(gè)例子:
wg21/p0062 warns that even volatile atomic
doesn't solve everything, and discourages its use for this purpose. It gives this example:
if(x) {
foo();
y.store(0);
} else {
bar();
y.store(0); // release a lock before a long-running loop
for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.
即使使用 volatile atomic
,允許編譯器從 if/else
中提取 y.store()
并且只做一次,因?yàn)樗匀恢蛔?1存儲(chǔ)相同的值.(這將在 else 分支中的長(zhǎng)循環(huán)之后).特別是如果商店只是 relaxed
或 release
而不是 seq_cst
.
Even with volatile atomic<int> y
, a compiler is allowed to sink the y.store()
out of the if/else
and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed
or release
instead of seq_cst
.
volatile
確實(shí)停止了問題中討論的合并,但這指出 atomic<>
上的其他優(yōu)化對(duì)于實(shí)際性能也可能存在問題.
volatile
does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<>
can also be problematic for real performance.
不優(yōu)化的其他原因包括:沒有人編寫復(fù)雜的代碼來允許編譯器安全地進(jìn)行這些優(yōu)化(而不會(huì)出錯(cuò)).這還不夠,因?yàn)?N4455 表示 LLVM 已經(jīng)實(shí)現(xiàn)或可以輕松實(shí)現(xiàn)它提到的幾個(gè)優(yōu)化.
Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.
不過,讓程序員感到困惑的原因當(dāng)然是有道理的.無鎖代碼一開始就很難正確編寫.
The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.
不要隨意使用原子武器:它們并不便宜,也沒有進(jìn)行太多優(yōu)化(目前根本沒有).但是,使用 std::shared_ptr<T>
避免冗余原子操作并不總是那么容易,因?yàn)樗鼪]有非原子版本(盡管 這里的一個(gè)答案給出了一個(gè)簡(jiǎn)單的方法為 gcc 定義一個(gè) shared_ptr_unsynchronized
).
Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>
, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T>
for gcc).
這篇關(guān)于為什么編譯器不合并冗余的 std::atomic 寫入?的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!