問題描述
給定 std::bitset<64>bits
具有任意數(shù)量的位設(shè)置和位位置 X
(0-63)
Given std::bitset<64> bits
with any number of bits set and a bit position X
(0-63)
在 X 位置或更低位置計(jì)數(shù)或在 X 位置未設(shè)置時(shí)返回 0 的最有效方法是什么
What is the most efficient way to count bits at position X or lower or return 0 if the bit at X is not set
注意:如果設(shè)置了該位,則返回值始終至少為 1
蠻力方式很慢:
int countupto(std::bitset<64> bits, int X)
{
if (!bits[X]) return 0;
int total=1;
for (int i=0; i < X; ++i)
{
total+=bits[i];
}
return total;
}
bitset
的 count()
方法會(huì)給你所有比特的 popcount
,但是 bitset
不支持范圍
The count()
methof of bitset
will give you the popcount
of all the bits, but bitset
does not support ranges
注意:這不是 如何計(jì)算 32 位整數(shù)中設(shè)置的位數(shù)? 因?yàn)樗儐柫瞬皇欠秶?0 到 X 的所有位
Note: This is not a dup of How to count the number of set bits in a 32-bit integer? as that asks about all bits not the range 0 through X
推薦答案
這個(gè) C++ 讓 g++ 發(fā)出 非常好的 x86 ASM(godbolt 編譯器資源管理器).我希望它也能在其他 64 位架構(gòu)上有效地編譯(如果有用于 std::bitset::count
的 HW popcount,否則這將始終是緩慢的部分;例如,一定要使用g++ -march=nehalem
或更高,或者 -mpopcnt
如果您不想啟用任何其他功能,如果您可以將代碼限制為僅在支持該功能的 CPU 上運(yùn)行x86 指令):
This C++ gets g++ to emit very good x86 ASM (godbolt compiler explorer). I expect it will compile efficiently on other 64bit architectures, too (if there's a HW popcount for std::bitset::count
to use, otherwise that'll always be the slow part; e.g. sure to use g++ -march=nehalem
or higher, or -mpopcnt
if you don't want to enable anything else, if you can limit your code to only running on CPUs that support that x86 instruction):
#include <bitset>
int popcount_subset(std::bitset<64> A, int pos) {
int high_bits_to_eliminate = 63 - pos;
A <<= (high_bits_to_eliminate & 63); // puts A[pos] at A[63].
return (A[63]? ~0ULL : 0) & A.count(); // most efficient way: great code with gcc and clang
// see the godbolt link for some #ifdefs with other ways to do the check, like
// return A[BSET_SIZE-1] ? A.count() : 0;
}
這在 32 位架構(gòu)上可能不是最佳選擇,因此如果您需要進(jìn)行 32 位構(gòu)建,請比較其他替代方案.
This probably isn't optimal on 32bit architectures, so compare other alternatives if you need to make a 32bit build.
這將適用于其他大小的位集,只要您對硬編碼的 63
進(jìn)行一些處理,并更改 &63
掩碼用于將移位計(jì)數(shù)轉(zhuǎn)換為更一般的范圍檢查.為了使用奇怪大小的位集獲得最佳性能,請制作一個(gè)模板函數(shù),專門針對目標(biāo)機(jī)器的 size <= register width
.在這種情況下,將 bitset 提取為適當(dāng)寬度的 unsigned
類型,并移到寄存器的頂部而不是 bitset 的頂部.
This will work for other sizes of bitset, as long as you do something about the hard-coded 63
s, and change the & 63
mask for the shift count into a more general range-check. For optimal performance with strange size bitsets, make a template function with a specialization for size <= register width
of the target machine. In that case, extract the bitset to an unsigned
type of the appropriate width, and shift to the top of the register instead of the top of the bitset.
您希望這也為 bitset<32>
生成理想的代碼,但事實(shí)并非如此.gcc/clang 在 x86-64 上仍然使用 64 位寄存器.
You'd expect this to also generate ideal code for bitset<32>
, but it doesn't quite. gcc/clang still use 64bit registers on x86-64.
對于大的bitsets,移動(dòng)整個(gè)事物會(huì)比僅僅popcounting包含pos
的詞下面的詞慢,然后在那個(gè)詞上使用它.(如果您可以假設(shè) SSSE3 而不是 popcnt
insn 硬件支持,或者對于 32 位目標(biāo),那么矢量化 popcount 真正在 x86 上大放異彩.AVX2 256 位 pshufb
是最快的方式進(jìn)行批量 popcounts,但沒有 AVX2 我認(rèn)為 64 位 popcnt
非常接近 128 位 pshufb
實(shí)現(xiàn).更多討論請參閱評論.)
For large bitsets, shifting the whole thing will be slower than just popcounting the words below the one containing pos
, and using this on that word. (This is where a vectorized popcount really shines on x86 if you can assume SSSE3 but not the popcnt
insn hardware support, or for 32bit targets. AVX2 256bit pshufb
is the fastest way to do bulk popcounts, but without AVX2 I think 64bit popcnt
is pretty close to a 128-bit pshufb
implementation. See the comments for more discussion.)
如果您有一個(gè) 64 位元素的數(shù)組,并且想要分別計(jì)算每個(gè)元素中某個(gè)位置以下的位數(shù),那么您絕對應(yīng)該使用 SIMD.該算法的移位部分矢量化,而不僅僅是 popcnt 部分.在基于 pshufb
的 popcnt 之后,使用 psadbw
對全零寄存器對 64 位塊中的字節(jié)進(jìn)行水平求和,該 popcnt 分別為每個(gè)字節(jié)中的位生成計(jì)數(shù).SSE/AVX 沒有 64 位算術(shù)右移,但您可以使用不同的技術(shù)來混合每個(gè)元素的高位.
If you have an array of 64-bit elements, and want to count bits below a certain position in each one separately, then you definitely should use SIMD. The shift parts of this algorithm vectorize, not just the popcnt part. Use psadbw
against an all-zero register to horizontal-sum bytes in 64-bit chunks after a pshufb
-based popcnt that produces counts for the bits in each byte separately. SSE/AVX doesn't have 64-bit arithmetic right shift, but you can use a different technique to blend on the high bit of each element.
你想讓編譯器輸出的 asm 指令將:
The asm instructions you want to get the compiler to output will:
- 從 64 位值中刪除不需要的位
- 測試所需位的最高位.
- 算了吧.
- 返回 0 或 popcount,具體取決于測試結(jié)果.(無分支或分支實(shí)現(xiàn)都有優(yōu)點(diǎn).如果分支是可預(yù)測的,無分支實(shí)現(xiàn)往往會(huì)更慢.)
<小時(shí)>
1 的明顯方法是生成掩碼 ((1<<(pos+1)) -1
) 和 &代碼>它.一種更有效的方法是左移
63-pos
,將您想要打包的位留在寄存器的頂部.
The obvious way to do 1 is to generate a mask ((1<<(pos+1)) -1
) and &
it. A more efficient way is to left-shift by 63-pos
, leaving the bits you want packed at the top of a register.
這也有一個(gè)有趣的副作用,就是把你想測試的位作為寄存器的最高位.測試符號(hào)位,而不是任何其他任意位,需要的指令稍微少一些.算術(shù)右移可以將符號(hào)位廣播到寄存器的其余部分,從而實(shí)現(xiàn)比通常更高效的無分支代碼.
This also has the interesting side-effect of putting the bit you want to test as the top bit in the register. Testing the sign bit, rather than any other arbitrary bit, takes slightly fewer instructions. An arithmetic right shift can broadcast the sign bit to the rest of the register, allowing for more-efficient-than-usual branchless code.
執(zhí)行popcount 是一個(gè)討論較多的問題,但實(shí)際上是難題中更棘手的部分.在 x86 上,有非常高效的硬件支持,但僅限于最新的硬件.在 Intel CPU 上,popcnt
指令僅適用于 Nehalem 和更新版本.我忘記了 AMD 何時(shí)增加了支持.
Doing the popcount is a much-discussed problem, but is actually the trickier part of the puzzle. On x86, there is extremely efficient hardware support for it, but only on recent-enough hardware. On Intel CPUs, the popcnt
instruction is only available on Nehalem and newer. I forget when AMD added support.
因此,為了安全地使用它,您需要使用不使用 popcnt
的回退進(jìn)行 CPU 調(diào)度.或者,制作依賴/不依賴某些 CPU 功能的單獨(dú)二進(jìn)制文件.
So to use it safely, you need to either do CPU dispatching with a fallback that doesn't use popcnt
. Or, make separate binaries that do/don't depend on some CPU features.
popcount 可以通過幾種方式完成.一種使用 SSSE3 pshufb
來實(shí)現(xiàn) 4 位 LUT.不過,這在用于整個(gè)陣列時(shí)最有效,而不是一次使用單個(gè) 64b.標(biāo)量比特黑客在這里可能是最好的,并且不需要 SSSE3(因此將與具有 64 位但不具有 pshufb 的古老 AMD CPU 兼容.)
popcount without the popcnt
instruction can be done a few ways. One uses SSSE3 pshufb
to implement a 4-bit LUT. This is most effective when used on a whole array, rather than a single 64b at a time, though. Scalar bithacks might be best here, and wouldn't require SSSE3 (and so would be compatible with ancient AMD CPUs that have 64bit but not pshufb.)
(A[63]? ~0ULL : 0)
要求編譯器將高位廣播到所有其他位位置,允許將其用作 AND 掩碼為零(或不為) popcount 結(jié)果.請注意,即使對于大的 bitset 大小,它仍然只屏蔽 popcnt
的輸出,而不是 bitset 本身,所以 ~0ULL
很好,我使用 ULL 來確保不是曾經(jīng)要求編譯器僅將位廣播到寄存器的低 32b(例如,在 Windows 上使用 UL
).
(A[63]? ~0ULL : 0)
asks the compiler to broadcast the high bit to all other bit positions, allowing it to be used as an AND-mask to zero (or not) the popcount result. Note that even for large bitset sizes, it's still only masking the output of popcnt
, not the bitset itself, so ~0ULL
is fine I used ULL to make sure wasn't ever asking the compiler to broadcast the bit only to the low 32b of a register (with UL
on Windows, for example).
這個(gè)廣播可以通過算術(shù)右移 63 來完成,這會(huì)移動(dòng)高位的副本.
This broadcast can be done with an arithmetic right shift by 63, which shifts in copies of the high bit.
clang 從原始版本生成此代碼.在 Glenn 就 4 的不同實(shí)現(xiàn)提出一些建議后,我意識(shí)到我可以通過編寫更像我想要的 ASM 的源代碼來引導(dǎo) gcc 走向 clang 的最佳解決方案.明顯的 ((int64_t)something) >>63
更直接地請求算術(shù)右移不會(huì)嚴(yán)格可移植,因?yàn)閹Х?hào)的右移是 實(shí)現(xiàn)定義為算術(shù)或邏輯.該標(biāo)準(zhǔn)不提供任何可移植的算術(shù)右移運(yùn)算符.(這不是未定義行為,盡管如此.)無論如何,幸運(yùn)的是編譯器足夠聰明:一旦你給它足夠的提示,gcc 就會(huì)看到最好的方法.
clang generated this code from the original version. After some prodding from Glenn about different implementations for 4, I realized that I could lead gcc towards clang's optimal solution by writing the source more like the ASM I want. The obvious ((int64_t)something) >> 63
to more directly request an arithmetic right shift would not be strictly portable, because signed right-shifts are implementation-defined as either arithmetic or logical. The standard doesn't provide any portable arithmetic right-shift operator. (It's not undefined behaviour, though.) Anyway, fortunately compilers are smart enough: gcc sees the best way once you give it enough of a hint.
這個(gè)源代碼在 x86-64 和 ARM64 上使用 gcc 和 clang 編寫了很棒的代碼.兩者都只是在 popcnt 的輸入上使用算術(shù)右移(因此移位可以與 popcnt 并行運(yùn)行).它還可以在帶有 gcc 的 32 位 x86 上很好地編譯,因?yàn)槠帘沃话l(fā)生在 32 位變量上(在添加了多個(gè) popcnt 結(jié)果之后).其余的函數(shù)在 32 位上很糟糕(當(dāng)位集大于寄存器時(shí)).
This source makes great code on x86-64 and ARM64 with gcc and clang. Both simply use an arithmetic right shift on the input to popcnt (so the shift can run in parallel with the popcnt). It also compiles great on 32bit x86 with gcc, because the masking only happens to a 32bit variable (after multiple popcnt results are added). It's the rest of the function that's nasty on 32bit (when the bitset is larger than a register).
帶有 gcc 的原始三元運(yùn)算符版本
使用 gcc 5.3.0 編譯 -O3 -march=nehalem -mtune=haswell
(較舊的 gcc,如 4.9.2,也仍然發(fā)出這個(gè)):
Compiled with gcc 5.3.0 -O3 -march=nehalem -mtune=haswell
(older gcc, like 4.9.2, also still emit this):
; the original ternary-operator version. See below for the optimal version we can coax gcc into emitting.
popcount_subset(std::bitset<64ul>, int):
; input bitset in rdi, input count in esi (SysV ABI)
mov ecx, esi ; x86 variable-count shift requires the count in cl
xor edx, edx ; edx=0
xor eax, eax ; gcc's workaround for popcnt's false dependency on the old value of dest, on Intel
not ecx ; two's complement bithack for 63-pos (in the low bits of the register)
sal rdi, cl ; rdi << ((63-pos) & 63); same insn as shl (arithmetic == logical left shift)
popcnt rdx, rdi
test rdi, rdi ; sets SF if the high bit is set.
cmovs rax, rdx ; conditional-move on the sign flag
ret
見如何證明 C 語句 -x、~x+1 和 ~(x-1) 產(chǎn)生相同的結(jié)果? gcc 使用 -x == ~ 的背景x + 1
補(bǔ)碼恒等式.(還有 哪個(gè) 2 的補(bǔ)整數(shù)如果只需要結(jié)果的低部分,則可以在不將輸入中的高位歸零的情況下使用操作? 切切地提到 shl
掩蓋了移位計(jì)數(shù),所以我們只需要低位6 位 ecx
保存 63 - pos
.主要是鏈接它,因?yàn)槲易罱鼘懥怂魏稳栽陂喿x這一段的人可能會(huì)覺得它很有趣.)
See How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results? for background on gcc's use of the -x == ~x + 1
two's complement identity. (And Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? which tangentially mentions that shl
masks the shift count, so we only need the low 6 bits of ecx
to hold 63 - pos
. Mostly linking that because I wrote it recently and anyone still reading this paragraph might find it interesting.)
內(nèi)聯(lián)時(shí),其中一些指令將消失.(例如,gcc 會(huì)首先在 ecx 中生成計(jì)數(shù).)
Some of those instructions will go away when inlining. (e.g. gcc would generate the count in ecx in the first place.)
使用 Glenn 的乘法而不是三元運(yùn)算符 的想法(由 USE_mul
啟用),gcc 做到了
With Glenn's multiply instead of ternary operator idea (enabled by USE_mul
), gcc does
shr rdi, 63
imul eax, edi
在末尾而不是 xor
/test
/cmovs
.
mov r,r
:1 個(gè)融合域 uop,0 延遲,無執(zhí)行單元xor
-zeroing:1 個(gè)融合域 uop,無執(zhí)行單元not
:p0/p1/p5/p6 1 uop,1c 延遲,每 0.25c 吞吐量 1 個(gè)shl
(又名sal
)在cl
中有計(jì)數(shù):p0/p6 為 3 uops:2c 延遲,每 2c 吞吐量 1 個(gè).(奇怪的是,Agner Fog 的數(shù)據(jù)表明 IvyBridge 只需要 2 個(gè) uops.)popcnt
:p1 1 uop,3c 延遲,每 1c 吞吐量 1 個(gè)shr?? r,imm
:p0/p6 1 uop,1c 延遲.每 0.5c 吞吐量 1 個(gè).imul r,r
:p1、3c 延遲為 1uop.- 不算
ret
mov r,r
: 1 fused-domain uop, 0 latency, no execution unitxor
-zeroing: 1 fused-domain uop, no execution unitnot
: 1 uop for p0/p1/p5/p6, 1c latency, 1 per 0.25c throughputshl
(akasal
) with count incl
: 3 uops for p0/p6: 2c latency, 1 per 2c throughput. (Agner Fog's data indicates that IvyBridge only takes 2 uops for this, strangely.)popcnt
: 1 uop for p1, 3c latency, 1 per 1c throughputshr r,imm
: 1 uop for p0/p6, 1c latency. 1 per 0.5c throughput.imul r,r
: 1uop for p1, 3c latency.- not counting the
ret
總計(jì):
- 9 個(gè)融合域 uop,可以在 2.25 個(gè)周期內(nèi)發(fā)出(理論上;uop 緩存行效應(yīng)通常會(huì)略微限制前端).
- p0/p6 的 4 個(gè) uops(班次).p1 2 uop.1 個(gè)任意 ALU 端口 uop.可以每 2c 執(zhí)行一個(gè)(使移位端口飽和),因此前端是最嚴(yán)重的瓶頸.
- 9 fused-domain uops, can issue in 2.25 cycles (in theory; uop cache-line effects usually bottleneck the frontend slightly).
- 4 uops (shifts) for p0/p6. 2 uops for p1. 1 any-ALU-port uop. Can execute at one per 2c (saturating the shift ports), so the frontend is the worst bottleneck.
延遲:從比特集準(zhǔn)備好到結(jié)果為:shl
(2) -> popcnt
(3) -> imul代碼>(3).總共8 個(gè)周期.或者當(dāng)
pos
準(zhǔn)備好時(shí) 9c,因?yàn)?not
是它的額外 1c 延遲.
Latency: Critical path from when the bitset is ready to when the result is: shl
(2) -> popcnt
(3) -> imul
(3). Total 8 cycles. Or 9c from when pos
is ready, because the not
is an extra 1c latency for it.
最佳bitbroadcast
版本 將shr??
替換為sar
(相同的性能)和imul
帶有 和
(1c 延遲而不是 3c,在任何端口上運(yùn)行).因此,唯一的性能變化是將關(guān)鍵路徑延遲減少到 6 個(gè)周期.前端的吞吐量仍然是瓶頸.and
能夠在任何端口上運(yùn)行并沒有什么區(qū)別,除非您將其與在 port1 上出現(xiàn)瓶頸的代碼混合在一起(而不是查看僅運(yùn)行this 緊密循環(huán)的代碼).
The optimal bitbroadcast
version replaces shr
with sar
(same perf), and imul
with and
(1c latency instead of 3c, runs on any port). So the only perf change is reducing the critical path latency to 6 cycles. Throughput is still bottlenecked on the frontend. and
being able to run on any port doesn't make a difference, unless you're mixing this with code that bottlenecks on port1 (instead of looking at the throughput for running just this code in a tight loop).
cmov(三元運(yùn)算符)版本:11 個(gè)融合域 uops(前端:每 2.75c 一個(gè)).執(zhí)行單元:仍然在每個(gè) 2c 一個(gè)的移位端口 (p0/p6) 上遇到瓶頸.延遲:從 bitset 到 result 為 7c,從 pos 到 result 為 8c.(cmov
是 2c 延遲,對于 p0/p1/p5/p6 中的任何一個(gè)都是 2 uops.)
cmov (ternary operator) version: 11 fused-domain uops (frontend: one per 2.75c). execution units: still bottlenecked on the shift ports (p0/p6) at one per 2c. Latency: 7c from bitset to result, 8c from pos to result. (cmov
is 2c latency, 2 uops for any of p0/p1/p5/p6.)
Clang 有一些不同的技巧:代替 test
/cmovs
,它生成一個(gè)全一或全的掩碼-zeros 通過使用算術(shù)右移將符號(hào)位廣播到寄存器的所有位置.我喜歡它:在英特爾上使用 and
而不是 cmov
更有效.不過,它仍然具有數(shù)據(jù)依賴性,并為分支的兩端工作(這是 cmov 的主要缺點(diǎn)).更新:如果源代碼正確,gcc 也會(huì)使用這種方法.
Clang has some different tricks up its sleeve: Instead of test
/cmovs
, it generates a mask of either all-ones or all-zeros by using an arithmetic right-shift to broadcast the sign bit to all positions of a register. I love it: Using and
instead of cmov
is more efficient on Intel. It still has the data-dependency and does the work for both sides of the branch (which is the main downside to cmov in general), though. Update: with the right source code, gcc will use this method, too.
clang 3.7 -O3 -Wall -march=nehalem -mtune=haswell
popcount_subset(std::bitset<64ul>, int):
mov ecx, 63
sub ecx, esi ; larger code size, but faster on CPUs without mov-elimination
shl rdi, cl ; rdi << ((63-pos) & 63)
popcnt rax, rdi ; doesn't start a fresh dep chain before this, like gcc does
sar rdi, 63 ; broadcast the sign bit
and eax, edi ; eax = 0 or its previous value
ret
sar/and
替換了 xor/test/cmov
,并且 cmov
是 Intel CPU 上的 2-uop 指令,所以這真的很好.(對于三元運(yùn)算符版本).
sar / and
replaces xor / test / cmov
, and cmov
is a 2-uop instruction on Intel CPUs, so that's really nice. (For the ternary-operator version).
在使用乘法源版本或bitbroadcast"源版本時(shí),
Clang 仍然使用 sar/和
技巧,而不是實(shí)際的 imul
.所以那些幫助 gcc 而不傷害 clang.(sar/and
肯定比 shr??/imul
好:關(guān)鍵路徑上的延遲減少了 2c.)pow_of_two_sub
版本確實(shí)傷害了 clang(參見第一個(gè)godbolt鏈接:從這個(gè)答案中省略,以避免與沒有成功的想法混淆).
Clang still does the sar / and
trick instead of an actual imul
when using the multiply source version, or the "bitbroadcast" source version. So those help gcc without hurting clang. (sar/and
is definitely better than shr/imul
: 2c less latency on the critical path.) The pow_of_two_sub
version does hurt clang (see the first godbolt link: omitted from this answer to avoid clutter with ideas that didn't pan out).
mov ecx, 63
/sub ecx, esi
實(shí)際上在 CPU 上更快,無需消除 reg,reg 移動(dòng)(零延遲和無執(zhí)行端口,由寄存器重命名處理).這包括 Intel IvyBridge 之前的版本,但不包括更新的 Intel 和 AMD CPU.
The mov ecx, 63
/ sub ecx, esi
is actually faster on CPUs without mov-elimination for reg,reg moves (zero latency and no execution port, handled by register renaming). This includes Intel pre-IvyBridge, but not more recent Intel and AMD CPUs.
Clang 的 mov imm
/sub
方法只將 pos
的一個(gè)周期延遲放到關(guān)鍵路徑上(超出 bitset->result 延遲),而不是 mov ecx, esi
/not ecx
在 mov r,r
具有 1c 延遲的 CPU 上的兩個(gè).
Clang's mov imm
/ sub
method puts only one cycle of latency for pos
onto the critical path (beyond the bitset->result latency), instead of two for a mov ecx, esi
/ not ecx
on CPUs where mov r,r
has 1c latency.
使用 BMI2(Haswell 及更高版本),最佳 ASM 版本可以將 mov
保存到 ecx
.其他一切都一樣,因?yàn)?shlx
將其移位??計(jì)數(shù)輸入寄存器屏蔽到操作數(shù)大小,就像 shl
一樣.
With BMI2 (Haswell and later), an optimal ASM version can save a mov
to ecx
. Everything else works the same, because shlx
masks its shift-count input register down to the operand-size, just like shl
.
x86 移位指令具有瘋狂的 CISC 語義,如果移位計(jì)數(shù)為零,則標(biāo)志不受影響.因此,可變計(jì)數(shù)移位指令對標(biāo)志的舊值具有(潛在的)依賴性.普通"x86 shl r, cl
在 Haswell 上解碼為 3 uops,但 BMI2 shlx r, r, r
僅為 1.所以 gcc 仍然發(fā)出 sal
與 -march=haswell
一起使用,而不是使用 shlx
(它確實(shí)在其他一些情況下使用).
x86 shift instructions have crazy CISC semantics where if the shift count is zero, the flags aren't affected. So variable-count shift instructions have a (potential) dependency on the old value of the flags. "Normal" x86 shl r, cl
decodes to 3 uops on Haswell, but BMI2 shlx r, r, r
is only 1. So it's too bad that gcc still emits sal
with -march=haswell
, instead of using shlx
(which it does use in some other cases).
// hand-tuned BMI2 version using the NOT trick and the bitbroadcast
popcount_subset(std::bitset<64ul>, int):
not esi ; The low 6 bits hold 63-pos. gcc's two-s complement trick
xor eax, eax ; break false dependency on Intel. maybe not needed when inlined.
shlx rdi, rdi, rsi ; rdi << ((63-pos) & 63)
popcnt rax, rdi
sar rdi, 63 ; broadcast the sign bit: rdi=0 or -1
and eax, edi ; eax = 0 or its previous value
ret
英特爾 Haswell 的性能分析:6 個(gè)融合域 uops(前端:每 1.5c 一個(gè)).執(zhí)行單元:2 p0/p6 shift uops.1 p1 uop.2 個(gè)任意端口 uops:(從總執(zhí)行端口限制中每 1.25c 一個(gè)).關(guān)鍵路徑延遲:shlx
(1) -> popcnt
(3) -> and
(1) = 5c bitset->result.(或來自 pos
->result 的 6c).
Perf analysis for Intel Haswell: 6 fused-domain uops (frontend: one per 1.5c). Execution units: 2 p0/p6 shift uops. 1 p1 uop. 2 any-port uops: (one per 1.25c from total execution port limits). Critical path latency: shlx
(1) -> popcnt
(3) -> and
(1) = 5c bitset->result. (or 6c from pos
->result).
請注意,內(nèi)聯(lián)時(shí),人類(或智能編譯器)可以避免對 xor eax, eax
的需要.它只是因?yàn)?popcnt
對輸出寄存器的錯(cuò)誤依賴(在 Intel 上),我們需要 eax
中的輸出(調(diào)用者可能最近使用了很長時(shí)間)鏈).使用 -mtune=bdver2
或其他東西,gcc 不會(huì)將用于 popcnt
輸出的寄存器清零.
Note that when inlining, a human (or smart compiler) could avoid the need for the xor eax, eax
. It's only there because of popcnt
's false dependency on the output register (on Intel), and we need the output in eax
(which the caller may have used recently for a long dep chain). With -mtune=bdver2
or something, gcc won't zero the register it's going to use for popcnt
output.
內(nèi)聯(lián)時(shí),我們可以使用一個(gè)輸出寄存器,該寄存器至少早在 popcnt
的源代碼中就已經(jīng)準(zhǔn)備好了,以避免出現(xiàn)問題.當(dāng)以后不需要源時(shí),編譯器將執(zhí)行就地 popcnt rdi,rdi
,但這里不是這種情況.相反,我們可以選擇另一個(gè)在源之前已經(jīng)準(zhǔn)備好的寄存器.popcnt
的輸入依賴于63-pos
,我們可以破壞它,所以popcnt rsi,rdi
對rsi的依賴不能延遲它.或者如果我們在寄存器中有 63
,我們可以 popcnt rsi,rdi
/sarx rax, rsi, reg_63
/ and eax,esi
.或者 BMI2 3 操作數(shù)移位指令也可以讓我們不破壞輸入,以防以后需要它們.
When inlining, we could use an output register that already has to be ready at least as early as popcnt
's source reg to avoid the problem. Compilers will do an in-place popcnt rdi,rdi
when the source isn't needed later, but that's not the case here. Instead, we can pick another register that already has to be ready before the source. popcnt
's input depends on 63-pos
, and we can clobber it, so popcnt rsi,rdi
's dependency on rsi can't delay it. Or if we had 63
in a register, we could popcnt rsi,rdi
/ sarx rax, rsi, reg_63
/ and eax, esi
. Or BMI2 3-operand shift instructions would also let us not clobber inputs in case they're needed afterwards.
這太輕了,以至于循環(huán)開銷和設(shè)置輸入操作數(shù)/存儲(chǔ)結(jié)果將成為主要因素.(并且 63-pos
可以使用編譯時(shí)常量進(jìn)行優(yōu)化,或者優(yōu)化到變量計(jì)數(shù)的來源.)
This is so light-weight that loop overhead and setting up the input operands / storing the results are going to be major factors. (And the 63-pos
can optimize away with a compile-time constant, or into wherever a variable count comes from.)
英特爾編譯器很有趣,它沒有利用 A[63] 是符號(hào)位這一事實(shí).shl
/bt rdi, 63
/jc
.它甚至以一種非常愚蠢的方式設(shè)置了分支.它可以將 eax 歸零,然后根據(jù) shl
設(shè)置的符號(hào)標(biāo)志跳過 popcnt 或不跳過.
The Intel compiler amusingly shoots itself in the foot and doesn't take advantage of the fact that A[63] is the sign bit. shl
/ bt rdi, 63
/ jc
. It even sets up the branches in a really dumb way. It could zero eax, and then jump over popcnt or not based on the sign flag set by shl
.
最佳分支實(shí)現(xiàn),從 Godbolt 上 -O3 -march=corei7
的 ICC13 輸出開始:
An optimal branching implementation, starting from ICC13 output from -O3 -march=corei7
on godbolt:
// hand-tuned, not compiler output
mov ecx, esi ; ICC uses neg/add/mov :/
not ecx
xor eax, eax ; breaks the false dep, or is the return value in the taken-branch case
shl rdi, cl
jns .bit_not_set
popcnt rax, rdi
.bit_not_set:
ret
這幾乎是最優(yōu)的:A[pos] == true
情況有一個(gè)未被采用的分支.不過,與無分支方法相比,它并沒有節(jié)省多少.
That's pretty much optimal: The A[pos] == true
case has one not-taken branch. It doesn't save very much over the branchless method, though.
如果 A[pos] == false
情況更常見:跳過 ret
指令,到 popcnt
/>ret
.(或在內(nèi)聯(lián)之后:跳轉(zhuǎn)到執(zhí)行 popcnt
并跳回的末尾塊).
If the A[pos] == false
case is more common: jump over a ret
instruction, to a popcnt
/ ret
. (Or after inlining: jump to a block at the end that does the popcnt
and jumps back).
這篇關(guān)于在一個(gè)位置或更低位置計(jì)算設(shè)置位的有效方法是什么?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!