pbootcms网站模板|日韩1区2区|织梦模板||网站源码|日韩1区2区|jquery建站特效-html5模板网

在一個(gè)位置或更低位置計(jì)算設(shè)置位的有效方法是

What is the efficient way to count set bits at a position or lower?(在一個(gè)位置或更低位置計(jì)算設(shè)置位的有效方法是什么?)
本文介紹了在一個(gè)位置或更低位置計(jì)算設(shè)置位的有效方法是什么?的處理方法,對大家解決問題具有一定的參考價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)吧!

問題描述

給定 std::bitset<64>bits 具有任意數(shù)量的位設(shè)置和位位置 X (0-63)

Given std::bitset<64> bits with any number of bits set and a bit position X (0-63)

在 X 位置或更低位置計(jì)數(shù)或在 X 位置未設(shè)置時(shí)返回 0 的最有效方法是什么

What is the most efficient way to count bits at position X or lower or return 0 if the bit at X is not set

注意:如果設(shè)置了該位,則返回值始終至少為 1

蠻力方式很慢:

int countupto(std::bitset<64> bits, int X)
{
  if (!bits[X]) return 0;
  int total=1;
  for (int i=0; i < X; ++i)
  {
    total+=bits[i];
  }
  return total;
}

bitsetcount() 方法會(huì)給你所有比特的 popcount,但是 bitset不支持范圍

The count() methof of bitset will give you the popcount of all the bits, but bitset does not support ranges

注意:這不是 如何計(jì)算 32 位整數(shù)中設(shè)置的位數(shù)? 因?yàn)樗儐柫瞬皇欠秶?0 到 X 的所有位

Note: This is not a dup of How to count the number of set bits in a 32-bit integer? as that asks about all bits not the range 0 through X

推薦答案

這個(gè) C++ 讓 g++ 發(fā)出 非常好的 x86 ASM(godbolt 編譯器資源管理器).我希望它也能在其他 64 位架構(gòu)上有效地編譯(如果有用于 std::bitset::count 的 HW popcount,否則這將始終是緩慢的部分;例如,一定要使用g++ -march=nehalem 或更高,或者 -mpopcnt 如果您不想啟用任何其他功能,如果您可以將代碼限制為僅在支持該功能的 CPU 上運(yùn)行x86 指令):

This C++ gets g++ to emit very good x86 ASM (godbolt compiler explorer). I expect it will compile efficiently on other 64bit architectures, too (if there's a HW popcount for std::bitset::count to use, otherwise that'll always be the slow part; e.g. sure to use g++ -march=nehalem or higher, or -mpopcnt if you don't want to enable anything else, if you can limit your code to only running on CPUs that support that x86 instruction):

#include <bitset>

int popcount_subset(std::bitset<64> A, int pos) {
  int high_bits_to_eliminate = 63 - pos;
  A <<= (high_bits_to_eliminate & 63);  // puts A[pos] at A[63].

  return (A[63]? ~0ULL : 0) & A.count();  // most efficient way: great code with gcc and clang
  // see the godbolt link for some #ifdefs with other ways to do the check, like
    // return A[BSET_SIZE-1] ? A.count() : 0;
}

這在 32 位架構(gòu)上可能不是最佳選擇,因此如果您需要進(jìn)行 32 位構(gòu)建,請比較其他替代方案.

This probably isn't optimal on 32bit architectures, so compare other alternatives if you need to make a 32bit build.

這將適用于其他大小的位集,只要您對硬編碼的 63 進(jìn)行一些處理,并更改 &63 掩碼用于將移位計(jì)數(shù)轉(zhuǎn)換為更一般的范圍檢查.為了使用奇怪大小的位集獲得最佳性能,請制作一個(gè)模板函數(shù),專門針對目標(biāo)機(jī)器的 size <= register width.在這種情況下,將 bitset 提取為適當(dāng)寬度的 unsigned 類型,并移到寄存器的頂部而不是 bitset 的頂部.

This will work for other sizes of bitset, as long as you do something about the hard-coded 63s, and change the & 63 mask for the shift count into a more general range-check. For optimal performance with strange size bitsets, make a template function with a specialization for size <= register width of the target machine. In that case, extract the bitset to an unsigned type of the appropriate width, and shift to the top of the register instead of the top of the bitset.

您希望這也為 bitset<32> 生成理想的代碼,但事實(shí)并非如此.gcc/clang 在 x86-64 上仍然使用 64 位寄存器.

You'd expect this to also generate ideal code for bitset<32>, but it doesn't quite. gcc/clang still use 64bit registers on x86-64.

對于大的bitsets,移動(dòng)整個(gè)事物會(huì)比僅僅popcounting包含pos的詞下面的詞慢,然后在那個(gè)詞上使用它.(如果您可以假設(shè) SSSE3 而不是 popcnt insn 硬件支持,或者對于 32 位目標(biāo),那么矢量化 popcount 真正在 x86 上大放異彩.AVX2 256 位 pshufb 是最快的方式進(jìn)行批量 popcounts,但沒有 AVX2 我認(rèn)為 64 位 popcnt 非常接近 128 位 pshufb 實(shí)現(xiàn).更多討論請參閱評論.)

For large bitsets, shifting the whole thing will be slower than just popcounting the words below the one containing pos, and using this on that word. (This is where a vectorized popcount really shines on x86 if you can assume SSSE3 but not the popcnt insn hardware support, or for 32bit targets. AVX2 256bit pshufb is the fastest way to do bulk popcounts, but without AVX2 I think 64bit popcnt is pretty close to a 128-bit pshufb implementation. See the comments for more discussion.)

如果您有一個(gè) 64 位元素的數(shù)組,并且想要分別計(jì)算每個(gè)元素中某個(gè)位置以下的位數(shù),那么您絕對應(yīng)該使用 SIMD.該算法的移位部分矢量化,而不僅僅是 popcnt 部分.在基于 pshufb 的 popcnt 之后,使用 psadbw 對全零寄存器對 64 位塊中的字節(jié)進(jìn)行水平求和,該 popcnt 分別為每個(gè)字節(jié)中的位生成計(jì)數(shù).SSE/AVX 沒有 64 位算術(shù)右移,但您可以使用不同的技術(shù)來混合每個(gè)元素的高位.

If you have an array of 64-bit elements, and want to count bits below a certain position in each one separately, then you definitely should use SIMD. The shift parts of this algorithm vectorize, not just the popcnt part. Use psadbw against an all-zero register to horizontal-sum bytes in 64-bit chunks after a pshufb-based popcnt that produces counts for the bits in each byte separately. SSE/AVX doesn't have 64-bit arithmetic right shift, but you can use a different technique to blend on the high bit of each element.

你想讓編譯器輸出的 asm 指令將:

The asm instructions you want to get the compiler to output will:

  1. 從 64 位值中刪除不需要的位
  2. 測試所需位的最高位.
  3. 算了吧.
  4. 返回 0 或 popcount,具體取決于測試結(jié)果.(無分支或分支實(shí)現(xiàn)都有優(yōu)點(diǎn).如果分支是可預(yù)測的,無分支實(shí)現(xiàn)往往會(huì)更慢.)

<小時(shí)>

1 的明顯方法是生成掩碼 ((1<<(pos+1)) -1) 和 &它.一種更有效的方法是左移 63-pos,將您想要打包的位留在寄存器的頂部.


The obvious way to do 1 is to generate a mask ((1<<(pos+1)) -1) and & it. A more efficient way is to left-shift by 63-pos, leaving the bits you want packed at the top of a register.

這也有一個(gè)有趣的副作用,就是把你想測試的位作為寄存器的最高位.測試符號(hào)位,而不是任何其他任意位,需要的指令稍微少一些.算術(shù)右移可以將符號(hào)位廣播到寄存器的其余部分,從而實(shí)現(xiàn)比通常更高效的無分支代碼.

This also has the interesting side-effect of putting the bit you want to test as the top bit in the register. Testing the sign bit, rather than any other arbitrary bit, takes slightly fewer instructions. An arithmetic right shift can broadcast the sign bit to the rest of the register, allowing for more-efficient-than-usual branchless code.

執(zhí)行popcount 是一個(gè)討論較多的問題,但實(shí)際上是難題中更棘手的部分.在 x86 上,有非常高效的硬件支持,但僅限于最新的硬件.在 Intel CPU 上,popcnt 指令僅適用于 Nehalem 和更新版本.我忘記了 AMD 何時(shí)增加了支持.

Doing the popcount is a much-discussed problem, but is actually the trickier part of the puzzle. On x86, there is extremely efficient hardware support for it, but only on recent-enough hardware. On Intel CPUs, the popcnt instruction is only available on Nehalem and newer. I forget when AMD added support.

因此,為了安全地使用它,您需要使用不使用 popcnt 的回退進(jìn)行 CPU 調(diào)度.或者,制作依賴/不依賴某些 CPU 功能的單獨(dú)二進(jìn)制文件.

So to use it safely, you need to either do CPU dispatching with a fallback that doesn't use popcnt. Or, make separate binaries that do/don't depend on some CPU features.

popcount 可以通過幾種方式完成.一種使用 SSSE3 pshufb 來實(shí)現(xiàn) 4 位 LUT.不過,這在用于整個(gè)陣列時(shí)最有效,而不是一次使用單個(gè) 64b.標(biāo)量比特黑客在這里可能是最好的,并且不需要 SSSE3(因此將與具有 64 位但不具有 pshufb 的古老 AMD CPU 兼容.)

popcount without the popcnt instruction can be done a few ways. One uses SSSE3 pshufb to implement a 4-bit LUT. This is most effective when used on a whole array, rather than a single 64b at a time, though. Scalar bithacks might be best here, and wouldn't require SSSE3 (and so would be compatible with ancient AMD CPUs that have 64bit but not pshufb.)

(A[63]? ~0ULL : 0) 要求編譯器將高位廣播到所有其他位位置,允許將其用作 AND 掩碼為零(或不為) popcount 結(jié)果.請注意,即使對于大的 bitset 大小,它仍然只屏蔽 popcnt 的輸出,而不是 bitset 本身,所以 ~0ULL 很好,我使用 ULL 來確保不是曾經(jīng)要求編譯器僅將位廣播到寄存器的低 32b(例如,在 Windows 上使用 UL).

(A[63]? ~0ULL : 0) asks the compiler to broadcast the high bit to all other bit positions, allowing it to be used as an AND-mask to zero (or not) the popcount result. Note that even for large bitset sizes, it's still only masking the output of popcnt, not the bitset itself, so ~0ULL is fine I used ULL to make sure wasn't ever asking the compiler to broadcast the bit only to the low 32b of a register (with UL on Windows, for example).

這個(gè)廣播可以通過算術(shù)右移 63 來完成,這會(huì)移動(dòng)高位的副本.

This broadcast can be done with an arithmetic right shift by 63, which shifts in copies of the high bit.

clang 從原始版本生成此代碼.在 Glenn 就 4 的不同實(shí)現(xiàn)提出一些建議后,我意識(shí)到我可以通過編寫更像我想要的 ASM 的源代碼來引導(dǎo) gcc 走向 clang 的最佳解決方案.明顯的 ((int64_t)something) >>63 更直接地請求算術(shù)右移不會(huì)嚴(yán)格可移植,因?yàn)閹Х?hào)的右移是 實(shí)現(xiàn)定義為算術(shù)或邏輯.該標(biāo)準(zhǔn)不提供任何可移植的算術(shù)右移運(yùn)算符.(這不是未定義行為,盡管如此.)無論如何,幸運(yùn)的是編譯器足夠聰明:一旦你給它足夠的提示,gcc 就會(huì)看到最好的方法.

clang generated this code from the original version. After some prodding from Glenn about different implementations for 4, I realized that I could lead gcc towards clang's optimal solution by writing the source more like the ASM I want. The obvious ((int64_t)something) >> 63 to more directly request an arithmetic right shift would not be strictly portable, because signed right-shifts are implementation-defined as either arithmetic or logical. The standard doesn't provide any portable arithmetic right-shift operator. (It's not undefined behaviour, though.) Anyway, fortunately compilers are smart enough: gcc sees the best way once you give it enough of a hint.

這個(gè)源代碼在 x86-64 和 ARM64 上使用 gcc 和 clang 編寫了很棒的代碼.兩者都只是在 popcnt 的輸入上使用算術(shù)右移(因此移位可以與 popcnt 并行運(yùn)行).它還可以在帶有 gcc 的 32 位 x86 上很好地編譯,因?yàn)槠帘沃话l(fā)生在 32 位變量上(在添加了多個(gè) popcnt 結(jié)果之后).其余的函數(shù)在 32 位上很糟糕(當(dāng)位集大于寄存器時(shí)).

This source makes great code on x86-64 and ARM64 with gcc and clang. Both simply use an arithmetic right shift on the input to popcnt (so the shift can run in parallel with the popcnt). It also compiles great on 32bit x86 with gcc, because the masking only happens to a 32bit variable (after multiple popcnt results are added). It's the rest of the function that's nasty on 32bit (when the bitset is larger than a register).

帶有 gcc 的原始三元運(yùn)算符版本

使用 gcc 5.3.0 編譯 -O3 -march=nehalem -mtune=haswell(較舊的 gcc,如 4.9.2,也仍然發(fā)出這個(gè)):

Compiled with gcc 5.3.0 -O3 -march=nehalem -mtune=haswell (older gcc, like 4.9.2, also still emit this):

; the original ternary-operator version.  See below for the optimal version we can coax gcc into emitting.
popcount_subset(std::bitset<64ul>, int):
    ; input bitset in rdi, input count in esi (SysV ABI)
    mov     ecx, esi    ; x86 variable-count shift requires the count in cl
    xor     edx, edx    ; edx=0 
    xor     eax, eax    ; gcc's workaround for popcnt's false dependency on the old value of dest, on Intel
    not     ecx         ; two's complement bithack for 63-pos (in the low bits of the register)
    sal     rdi, cl     ; rdi << ((63-pos) & 63);  same insn as shl (arithmetic == logical left shift)
    popcnt  rdx, rdi
    test    rdi, rdi    ; sets SF if the high bit is set.
    cmovs   rax, rdx    ; conditional-move on the sign flag
    ret

見如何證明 C 語句 -x、~x+1 和 ~(x-1) 產(chǎn)生相同的結(jié)果? gcc 使用 -x == ~ 的背景x + 1 補(bǔ)碼恒等式.(還有 哪個(gè) 2 的補(bǔ)整數(shù)如果只需要結(jié)果的低部分,則可以在不將輸入中的高位歸零的情況下使用操作? 切切地提到 shl 掩蓋了移位計(jì)數(shù),所以我們只需要低位6 位 ecx 保存 63 - pos.主要是鏈接它,因?yàn)槲易罱鼘懥怂魏稳栽陂喿x這一段的人可能會(huì)覺得它很有趣.)

See How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results? for background on gcc's use of the -x == ~x + 1 two's complement identity. (And Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? which tangentially mentions that shl masks the shift count, so we only need the low 6 bits of ecx to hold 63 - pos. Mostly linking that because I wrote it recently and anyone still reading this paragraph might find it interesting.)

內(nèi)聯(lián)時(shí),其中一些指令將消失.(例如,gcc 會(huì)首先在 ecx 中生成計(jì)數(shù).)

Some of those instructions will go away when inlining. (e.g. gcc would generate the count in ecx in the first place.)

使用 Glenn 的乘法而不是三元運(yùn)算符 的想法(由 USE_mul 啟用),gcc 做到了

With Glenn's multiply instead of ternary operator idea (enabled by USE_mul), gcc does

    shr     rdi, 63
    imul    eax, edi

在末尾而不是 xor/test/cmovs.

  • mov r,r:1 個(gè)融合域 uop,0 延遲,無執(zhí)行單元
  • xor-zeroing:1 個(gè)融合域 uop,無執(zhí)行單元
  • not:p0/p1/p5/p6 1 uop,1c 延遲,每 0.25c 吞吐量 1 個(gè)
  • shl(又名 sal)在 cl 中有計(jì)數(shù):p0/p6 為 3 uops:2c 延遲,每 2c 吞吐量 1 個(gè).(奇怪的是,Agner Fog 的數(shù)據(jù)表明 IvyBridge 只需要 2 個(gè) uops.)
  • popcnt:p1 1 uop,3c 延遲,每 1c 吞吐量 1 個(gè)
  • shr?? r,imm:p0/p6 1 uop,1c 延遲.每 0.5c 吞吐量 1 個(gè).
  • imul r,r:p1、3c 延遲為 1uop.
  • 不算ret
  • mov r,r: 1 fused-domain uop, 0 latency, no execution unit
  • xor-zeroing: 1 fused-domain uop, no execution unit
  • not: 1 uop for p0/p1/p5/p6, 1c latency, 1 per 0.25c throughput
  • shl (aka sal) with count in cl: 3 uops for p0/p6: 2c latency, 1 per 2c throughput. (Agner Fog's data indicates that IvyBridge only takes 2 uops for this, strangely.)
  • popcnt: 1 uop for p1, 3c latency, 1 per 1c throughput
  • shr r,imm: 1 uop for p0/p6, 1c latency. 1 per 0.5c throughput.
  • imul r,r: 1uop for p1, 3c latency.
  • not counting the ret

總計(jì):

  • 9 個(gè)融合域 uop,可以在 2.25 個(gè)周期內(nèi)發(fā)出(理論上;uop 緩存行效應(yīng)通常會(huì)略微限制前端).
  • p0/p6 的 4 個(gè) uops(班次).p1 2 uop.1 個(gè)任意 ALU 端口 uop.可以每 2c 執(zhí)行一個(gè)(使移位端口飽和),因此前端是最嚴(yán)重的瓶頸.
  • 9 fused-domain uops, can issue in 2.25 cycles (in theory; uop cache-line effects usually bottleneck the frontend slightly).
  • 4 uops (shifts) for p0/p6. 2 uops for p1. 1 any-ALU-port uop. Can execute at one per 2c (saturating the shift ports), so the frontend is the worst bottleneck.

延遲:從比特集準(zhǔn)備好到結(jié)果為:shl(2) -> popcnt(3) -> imul(3).總共8 個(gè)周期.或者當(dāng) pos 準(zhǔn)備好時(shí) 9c,因?yàn)?not 是它的額外 1c 延遲.

Latency: Critical path from when the bitset is ready to when the result is: shl(2) -> popcnt(3) -> imul(3). Total 8 cycles. Or 9c from when pos is ready, because the not is an extra 1c latency for it.

最佳bitbroadcast 版本shr?? 替換為sar(相同的性能)和imul 帶有 (1c 延遲而不是 3c,在任何端口上運(yùn)行).因此,唯一的性能變化是將關(guān)鍵路徑延遲減少到 6 個(gè)周期.前端的吞吐量仍然是瓶頸.and 能夠在任何端口上運(yùn)行并沒有什么區(qū)別,除非您將其與在 port1 上出現(xiàn)瓶頸的代碼混合在一起(而不是查看僅運(yùn)行this 緊密循環(huán)的代碼).

The optimal bitbroadcast version replaces shr with sar (same perf), and imul with and (1c latency instead of 3c, runs on any port). So the only perf change is reducing the critical path latency to 6 cycles. Throughput is still bottlenecked on the frontend. and being able to run on any port doesn't make a difference, unless you're mixing this with code that bottlenecks on port1 (instead of looking at the throughput for running just this code in a tight loop).

cmov(三元運(yùn)算符)版本:11 個(gè)融合域 uops(前端:每 2.75c 一個(gè)).執(zhí)行單元:仍然在每個(gè) 2c 一個(gè)的移位端口 (p0/p6) 上遇到瓶頸.延遲:從 bitset 到 result 為 7c,從 pos 到 result 為 8c.(cmov 是 2c 延遲,對于 p0/p1/p5/p6 中的任何一個(gè)都是 2 uops.)

cmov (ternary operator) version: 11 fused-domain uops (frontend: one per 2.75c). execution units: still bottlenecked on the shift ports (p0/p6) at one per 2c. Latency: 7c from bitset to result, 8c from pos to result. (cmov is 2c latency, 2 uops for any of p0/p1/p5/p6.)

Clang 有一些不同的技巧:代替 test/cmovs,它生成一個(gè)全一或全的掩碼-zeros 通過使用算術(shù)右移將符號(hào)位廣播到寄存器的所有位置.我喜歡它:在英特爾上使用 and 而不是 cmov 更有效.不過,它仍然具有數(shù)據(jù)依賴性,并為分支的兩端工作(這是 cmov 的主要缺點(diǎn)).更新:如果源代碼正確,gcc 也會(huì)使用這種方法.

Clang has some different tricks up its sleeve: Instead of test/cmovs, it generates a mask of either all-ones or all-zeros by using an arithmetic right-shift to broadcast the sign bit to all positions of a register. I love it: Using and instead of cmov is more efficient on Intel. It still has the data-dependency and does the work for both sides of the branch (which is the main downside to cmov in general), though. Update: with the right source code, gcc will use this method, too.

clang 3.7 -O3 -Wall -march=nehalem -mtune=haswell

popcount_subset(std::bitset<64ul>, int):
    mov     ecx, 63
    sub     ecx, esi      ; larger code size, but faster on CPUs without mov-elimination
    shl     rdi, cl       ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi      ; doesn't start a fresh dep chain before this, like gcc does
    sar     rdi, 63       ; broadcast the sign bit
    and     eax, edi      ; eax = 0 or its previous value
    ret

sar/and 替換了 xor/test/cmov,并且 cmov 是 Intel CPU 上的 2-uop 指令,所以這真的很好.(對于三元運(yùn)算符版本).

sar / and replaces xor / test / cmov, and cmov is a 2-uop instruction on Intel CPUs, so that's really nice. (For the ternary-operator version).

在使用乘法源版本或bitbroadcast"源版本時(shí),

Clang 仍然使用 sar/和 技巧,而不是實(shí)際的 imul.所以那些幫助 gcc 而不傷害 clang.(sar/and 肯定比 shr??/imul 好:關(guān)鍵路徑上的延遲減少了 2c.)pow_of_two_sub 版本確實(shí)傷害了 clang(參見第一個(gè)godbolt鏈接:從這個(gè)答案中省略,以避免與沒有成功的想法混淆).

Clang still does the sar / and trick instead of an actual imul when using the multiply source version, or the "bitbroadcast" source version. So those help gcc without hurting clang. (sar/and is definitely better than shr/imul: 2c less latency on the critical path.) The pow_of_two_sub version does hurt clang (see the first godbolt link: omitted from this answer to avoid clutter with ideas that didn't pan out).

mov ecx, 63/sub ecx, esi 實(shí)際上在 CPU 上更快,無需消除 reg,reg 移動(dòng)(零延遲和無執(zhí)行端口,由寄存器重命名處理).這包括 Intel IvyBridge 之前的版本,但不包括更新的 Intel 和 AMD CPU.

The mov ecx, 63 / sub ecx, esi is actually faster on CPUs without mov-elimination for reg,reg moves (zero latency and no execution port, handled by register renaming). This includes Intel pre-IvyBridge, but not more recent Intel and AMD CPUs.

Clang 的 mov imm/sub 方法只將 pos 的一個(gè)周期延遲放到關(guān)鍵路徑上(超出 bitset->result 延遲),而不是 mov ecx, esi/not ecxmov r,r 具有 1c 延遲的 CPU 上的兩個(gè).

Clang's mov imm / sub method puts only one cycle of latency for pos onto the critical path (beyond the bitset->result latency), instead of two for a mov ecx, esi / not ecx on CPUs where mov r,r has 1c latency.

使用 BMI2(Haswell 及更高版本),最佳 ASM 版本可以將 mov 保存到 ecx.其他一切都一樣,因?yàn)?shlx 將其移位??計(jì)數(shù)輸入寄存器屏蔽到操作數(shù)大小,就像 shl 一樣.

With BMI2 (Haswell and later), an optimal ASM version can save a mov to ecx. Everything else works the same, because shlx masks its shift-count input register down to the operand-size, just like shl.

x86 移位指令具有瘋狂的 CISC 語義,如果移位計(jì)數(shù)為零,則標(biāo)志不受影響.因此,可變計(jì)數(shù)移位指令對標(biāo)志的舊值具有(潛在的)依賴性.普通"x86 shl r, cl 在 Haswell 上解碼為 3 uops,但 BMI2 shlx r, r, r 僅為 1.所以 gcc 仍然發(fā)出 sal-march=haswell 一起使用,而不是使用 shlx(它確實(shí)在其他一些情況下使用).

x86 shift instructions have crazy CISC semantics where if the shift count is zero, the flags aren't affected. So variable-count shift instructions have a (potential) dependency on the old value of the flags. "Normal" x86 shl r, cl decodes to 3 uops on Haswell, but BMI2 shlx r, r, r is only 1. So it's too bad that gcc still emits sal with -march=haswell, instead of using shlx (which it does use in some other cases).

// hand-tuned BMI2 version using the NOT trick and the bitbroadcast
popcount_subset(std::bitset<64ul>, int):
    not     esi           ; The low 6 bits hold 63-pos.  gcc's two-s complement trick
    xor     eax, eax      ; break false dependency on Intel.  maybe not needed when inlined.
    shlx    rdi, rdi, rsi ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi
    sar     rdi, 63       ; broadcast the sign bit: rdi=0 or -1
    and     eax, edi      ; eax = 0 or its previous value
    ret

英特爾 Haswell 的性能分析:6 個(gè)融合域 uops(前端:每 1.5c 一個(gè)).執(zhí)行單元:2 p0/p6 shift uops.1 p1 uop.2 個(gè)任意端口 uops:(從總執(zhí)行端口限制中每 1.25c 一個(gè)).關(guān)鍵路徑延遲:shlx(1) -> popcnt(3) -> and(1) = 5c bitset->result.(或來自 pos->result 的 6c).

Perf analysis for Intel Haswell: 6 fused-domain uops (frontend: one per 1.5c). Execution units: 2 p0/p6 shift uops. 1 p1 uop. 2 any-port uops: (one per 1.25c from total execution port limits). Critical path latency: shlx(1) -> popcnt(3) -> and(1) = 5c bitset->result. (or 6c from pos->result).

請注意,內(nèi)聯(lián)時(shí),人類(或智能編譯器)可以避免對 xor eax, eax 的需要.它只是因?yàn)?popcnt 對輸出寄存器的錯(cuò)誤依賴(在 Intel 上),我們需要 eax 中的輸出(調(diào)用者可能最近使用了很長時(shí)間)鏈).使用 -mtune=bdver2 或其他東西,gcc 不會(huì)將用于 popcnt 輸出的寄存器清零.

Note that when inlining, a human (or smart compiler) could avoid the need for the xor eax, eax. It's only there because of popcnt's false dependency on the output register (on Intel), and we need the output in eax (which the caller may have used recently for a long dep chain). With -mtune=bdver2 or something, gcc won't zero the register it's going to use for popcnt output.

內(nèi)聯(lián)時(shí),我們可以使用一個(gè)輸出寄存器,該寄存器至少早在 popcnt 的源代碼中就已經(jīng)準(zhǔn)備好了,以避免出現(xiàn)問題.當(dāng)以后不需要源時(shí),編譯器將執(zhí)行就地 popcnt rdi,rdi ,但這里不是這種情況.相反,我們可以選擇另一個(gè)在源之前已經(jīng)準(zhǔn)備好的寄存器.popcnt的輸入依賴于63-pos,我們可以破壞它,所以popcnt rsi,rdi對rsi的依賴不能延遲它.或者如果我們在寄存器中有 63,我們可以 popcnt rsi,rdi/sarx rax, rsi, reg_63/ and eax,esi.或者 BMI2 3 操作數(shù)移位指令也可以讓我們不破壞輸入,以防以后需要它們.

When inlining, we could use an output register that already has to be ready at least as early as popcnt's source reg to avoid the problem. Compilers will do an in-place popcnt rdi,rdi when the source isn't needed later, but that's not the case here. Instead, we can pick another register that already has to be ready before the source. popcnt's input depends on 63-pos, and we can clobber it, so popcnt rsi,rdi's dependency on rsi can't delay it. Or if we had 63 in a register, we could popcnt rsi,rdi / sarx rax, rsi, reg_63 / and eax, esi. Or BMI2 3-operand shift instructions would also let us not clobber inputs in case they're needed afterwards.

這太輕了,以至于循環(huán)開銷和設(shè)置輸入操作數(shù)/存儲(chǔ)結(jié)果將成為主要因素.(并且 63-pos 可以使用編譯時(shí)常量進(jìn)行優(yōu)化,或者優(yōu)化到變量計(jì)數(shù)的來源.)

This is so light-weight that loop overhead and setting up the input operands / storing the results are going to be major factors. (And the 63-pos can optimize away with a compile-time constant, or into wherever a variable count comes from.)

英特爾編譯器很有趣,它沒有利用 A[63] 是符號(hào)位這一事實(shí).shl/bt rdi, 63/jc.它甚至以一種非常愚蠢的方式設(shè)置了分支.它可以將 eax 歸零,然后根據(jù) shl 設(shè)置的符號(hào)標(biāo)志跳過 popcnt 或不跳過.

The Intel compiler amusingly shoots itself in the foot and doesn't take advantage of the fact that A[63] is the sign bit. shl / bt rdi, 63 / jc. It even sets up the branches in a really dumb way. It could zero eax, and then jump over popcnt or not based on the sign flag set by shl.

最佳分支實(shí)現(xiàn),從 Godbolt 上 -O3 -march=corei7 的 ICC13 輸出開始:

An optimal branching implementation, starting from ICC13 output from -O3 -march=corei7 on godbolt:

   // hand-tuned, not compiler output
        mov       ecx, esi    ; ICC uses neg/add/mov :/
        not       ecx
        xor       eax, eax    ; breaks the false dep, or is the return value in the taken-branch case
        shl       rdi, cl
        jns    .bit_not_set
        popcnt    rax, rdi
.bit_not_set:
        ret

這幾乎是最優(yōu)的:A[pos] == true 情況有一個(gè)未被采用的分支.不過,與無分支方法相比,它并沒有節(jié)省多少.

That's pretty much optimal: The A[pos] == true case has one not-taken branch. It doesn't save very much over the branchless method, though.

如果 A[pos] == false 情況更常見:跳過 ret 指令,到 popcnt/>ret.(或在內(nèi)聯(lián)之后:跳轉(zhuǎn)到執(zhí)行 popcnt 并跳回的末尾塊).

If the A[pos] == false case is more common: jump over a ret instruction, to a popcnt / ret. (Or after inlining: jump to a block at the end that does the popcnt and jumps back).

這篇關(guān)于在一個(gè)位置或更低位置計(jì)算設(shè)置位的有效方法是什么?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!

【網(wǎng)站聲明】本站部分內(nèi)容來源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問題,如果有圖片或者內(nèi)容侵犯了您的權(quán)益,請聯(lián)系我們刪除處理,感謝您的支持!

相關(guān)文檔推薦

How do I set the icon for my application in visual studio 2008?(如何在 Visual Studio 2008 中為我的應(yīng)用程序設(shè)置圖標(biāo)?)
Convert CString to const char*(將 CString 轉(zhuǎn)換為 const char*)
Remove secure warnings (_CRT_SECURE_NO_WARNINGS) from projects by default in Visual Studio(默認(rèn)情況下,在 Visual Studio 中從項(xiàng)目中刪除安全警告 (_CRT_SECURE_NO_WARNINGS))
How do I start a new CUDA project in Visual Studio 2008?(如何在 Visual Studio 2008 中啟動(dòng)新的 CUDA 項(xiàng)目?)
Exporting classes containing `std::` objects (vector, map etc.) from a DLL(從 DLL 導(dǎo)出包含 `std::` 對象(向量、映射等)的類)
What are some reasons a Release build would run differently than a Debug build(發(fā)布版本與調(diào)試版本的運(yùn)行方式不同的一些原因是什么)
主站蜘蛛池模板: 地图标注|微信高德百度地图标注|地图标记-做地图[ZuoMap.com] | 伊卡洛斯软装首页-电动窗帘,别墅窗帘,定制窗帘,江浙沪1000+别墅窗帘案例 | 登车桥动力单元-非标液压泵站-非标液压系统-深圳市三好科技有限公司 | 东风体检车厂家_公共卫生体检车_医院体检车_移动体检车-锦沅科贸 | 蓝鹏测控平台 - 智慧车间系统 - 车间生产数据采集与分析系统 | 真空上料机(一种真空输送机)-百科| 断桥铝破碎机_发动机破碎机_杂铝破碎机厂家价格-皓星机械 | 金库门,金库房,金库门厂家,金库门价格-河北特旺柜业有限公司 | 伸缩节_伸缩器_传力接头_伸缩接头_巩义市联通管道厂 | 创客匠人-让IP变现不走弯路| 保镖公司-私人保镖-深圳保镖公司【环宇兄弟保镖】 | 培训中心-翰香原香酥板栗饼加盟店总部-正宗板栗酥饼技术 | 单锥双螺旋混合机_双螺旋锥形混合机-无锡新洋设备科技有限公司 | 自动部分收集器,进口无油隔膜真空泵,SPME固相微萃取头-上海楚定分析仪器有限公司 | 包塑丝_高铁绑丝_地暖绑丝_涂塑丝_塑料皮铁丝_河北创筹金属丝网制品有限公司 | 上海冠顶工业设备有限公司-隧道炉,烘箱,UV固化机,涂装设备,高温炉,工业机器人生产厂家 | 北京企业宣传片拍摄_公司宣传片制作-广告短视频制作_北京宣传片拍摄公司 | 深圳标识制作公司-标识标牌厂家-深圳广告标识制作-玟璟广告-深圳市玟璟广告有限公司 | 胶水,胶粘剂,AB胶,环氧胶,UV胶水,高温胶,快干胶,密封胶,结构胶,电子胶,厌氧胶,高温胶水,电子胶水-东莞聚力-聚厉胶粘 | 烟台条码打印机_烟台条码扫描器_烟台碳带_烟台数据采集终端_烟台斑马打印机-金鹏电子-金鹏电子 | 塑料造粒机「厂家直销」-莱州鑫瑞迪机械有限公司 | HEYL硬度计量泵-荧光法在线溶解氧仪-净时测控技术(上海)有限公司 | 铁盒_铁罐_马口铁盒_马口铁罐_铁盒生产厂家-广州博新制罐 | 家用净水器代理批发加盟_净水机招商代理_全屋净水器定制品牌_【劳伦斯官网】 | 合金ICP光谱仪(磁性材料,工业废水)-百科| 碳化硅,氮化硅,冰晶石,绢云母,氟化铝,白刚玉,棕刚玉,石墨,铝粉,铁粉,金属硅粉,金属铝粉,氧化铝粉,硅微粉,蓝晶石,红柱石,莫来石,粉煤灰,三聚磷酸钠,六偏磷酸钠,硫酸镁-皓泉新材料 | 紫外荧光硫分析仪-硫含量分析仪-红外光度测定仪-泰州美旭仪器 | 石油/泥浆/不锈钢防腐/砂泵/抽砂泵/砂砾泵/吸砂泵/压滤机泵 - 专业石油环保专用泵厂家 | 展厅设计-展馆设计-专业企业展厅展馆设计公司-昆明华文创意 | 无菌实验室规划装修设计-一体化实验室承包-北京洁净净化工程建设施工-北京航天科恩实验室装备工程技术有限公司 | 北京中创汇安科贸有限公司 | 拖鞋定制厂家-品牌拖鞋代加工厂-振扬实业中国高端拖鞋大型制造商 | 铝合金电阻-无源谐波滤波器-上海稳达电讯设备厂| 舞台木地板厂家_体育运动木地板_室内篮球馆木地板_实木运动地板厂家_欧氏篮球地板推荐 | 保温杯,儿童婴童奶瓶,运动水壶「广告礼品杯定制厂家」超朗保温杯壶 | 台湾阳明固态继电器-奥托尼克斯光电传感器-接近开关-温控器-光纤传感器-编码器一级代理商江苏用之宜电气 | 创客匠人-让IP变现不走弯路| 婚博会2024时间表_婚博会门票领取_婚博会地址-婚博会官网 | 荣事达手推洗地机_洗地机厂家_驾驶式扫地机_工业清洁设备 | 广东泵阀展|阀门展-广东国际泵管阀展览会| SMN-1/SMN-A ABB抽屉开关柜触头夹紧力检测仪-SMN-B/SMN-C-上海徐吉 |