問題描述
SVML 的 __m256d _mm256_log2_pd (__m256d a)
在 Intel 之外的其他編譯器上不可用,他們說它的性能在 AMD 處理器上有缺陷.AVX 中提到了互聯(lián)網(wǎng)上的一些實現(xiàn)g++-4.8 中缺少日志內(nèi)在函數(shù) (_mm256_log_ps)? 和 SIMD 數(shù)學(xué)SSE 和 AVX 的庫,但是它們似乎比 AVX2 更 SSE.還有 Agner Fog 的矢量庫 ,但它是一個大型圖書館,擁有更多的東西,只是矢量 log2,所以從它的實現(xiàn)來看,很難找出向量 log2 操作的基本部分.
那么有人能解釋一下如何有效地為 4 個 double
數(shù)字的向量實現(xiàn) log2()
操作嗎?IE.就像 __m256d _mm256_log2_pd (__m256d a)
所做的那樣,但可用于其他編譯器,并且對 AMD 和 Intel 處理器都相當(dāng)有效.
在我當(dāng)前的特定情況下,數(shù)字是 0 到 1 之間的概率,對數(shù)用于熵計算:P[i] 的所有
.i
的求和的否定*日志(P[i])P[i]
的浮點指數(shù)范圍很大,因此數(shù)字可能接近 0.我不確定準(zhǔn)確性,因此會考慮任何以 30 位尾數(shù)開頭的解決方案,尤其是可調(diào)整的解決方案是首選.
這是我到目前為止的實現(xiàn),基于來自 https:/的更高效系列"/en.wikipedia.org/wiki/Logarithm#Power_series .如何改進?(性能和準(zhǔn)確性都需要改進)
命名空間{const __m256i gDoubleExpMask = _mm256_set1_epi64x(0x7ffULL << 52);const __m256i gDoubleExp0 = _mm256_set1_epi64x(1023ULL << 52);const __m256i gTo32bitExp = _mm256_set_epi32(0, 0, 0, 0, 6, 4, 2, 0);const __m128i gExpNormalizer = _mm_set1_epi32(1023);//TODO:這里是一些 128 位變量還是兩個 64 位變量?const __m256d gCommMul = _mm256_set1_pd(2.0/0.693147180559945309417);//2.0/ln(2)const __m256d gCoeff1 = _mm256_set1_pd(1.0/3);const __m256d gCoeff2 = _mm256_set1_pd(1.0/5);const __m256d gCoeff3 = _mm256_set1_pd(1.0/7);const __m256d gCoeff4 = _mm256_set1_pd(1.0/9);const __m256d gVect1 = _mm256_set1_pd(1.0);}__m256d __vectorcall Log2(__m256d x) {const __m256i exps64 = _mm256_srli_epi64(_mm256_and_si256(gDoubleExpMask, _mm256_castpd_si256(x)), 52);const __m256i exps32_avx = _mm256_permutevar8x32_epi32(exps64, gTo32bitExp);const __m128i exps32_sse = _mm256_castsi256_si128(exps32_avx);const __m128i normExps = _mm_sub_epi32(exps32_sse, gExpNormalizer);const __m256d expsPD = _mm256_cvtepi32_pd(normExps);const __m256d y = _mm256_or_pd(_mm256_castsi256_pd(gDoubleExp0),_mm256_andnot_pd(_mm256_castsi256_pd(gDoubleExpMask), x));//計算 t=(y-1)/(y+1) 和 t**2const __m256d tNum = _mm256_sub_pd(y, gVect1);const __m256d tDen = _mm256_add_pd(y, gVect1);const __m256d t = _mm256_div_pd(tNum, tDen);const __m256d t2 = _mm256_mul_pd(t, t);//t**2const __m256d t3 = _mm256_mul_pd(t, t2);//t**3const __m256d term01 = _mm256_fmadd_pd(gCoeff1, t3, t);const __m256d t5 = _mm256_mul_pd(t3, t2);//t**5const __m256d term012 = _mm256_fmadd_pd(gCoeff2,t5,terms01);const __m256d t7 = _mm256_mul_pd(t5, t2);//t**7const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3,t7,terms012);const __m256d t9 = _mm256_mul_pd(t7, t2);//t**9const __m256d 條款01234 = _mm256_fmadd_pd(gCoeff4,t9,條款0123);const __m256d log2_y = _mm256_mul_pd(terms01234, gCommMul);const __m256d log2_x = _mm256_add_pd(log2_y, expsPD);返回 log2_x;}
到目前為止,我的實現(xiàn)提供了每秒 405 268 490 次操作,并且似乎精確到第 8 位.性能通過以下函數(shù)來衡量:
#include #include <cmath>#include #include //... Log2() 在這里實現(xiàn)const int64_t cnLogs = 100 * 1000 * 1000;void BenchmarkLog2Vect() {__m256d 總和 = _mm256_setzero_pd();自動啟動 = std::chrono::high_resolution_clock::now();for (int64_t i = 1; i <= cnLogs; i += 4) {const __m256d x = _mm256_set_pd(double(i+3), double(i+2), double(i+1), double(i));const __m256d 日志 = Log2(x);總和 = _mm256_add_pd(總和,日志);}自動消逝 = std::chrono::high_resolution_clock::now() - 開始;double nSec = 1e-6 * std::chrono::duration_cast(elapsed).count();雙總和 = sums.m256d_f64[0] + sums.m256d_f64[1] + sums.m256d_f64[2] + sums.m256d_f64[3];printf("Vect Log2: %.3lf Ops/sec 計算出 %.3lf
", cnLogs/nSec, sum);}
對比
最后是我最好的結(jié)果,它在 Ryzen 1800X @3.6GHz 上每秒給出大約 8 億個對數(shù)(每個 4 個對數(shù)的 2 億個向量)單線程,直到尾數(shù)的最后幾位都是準(zhǔn)確的.劇透:看看到底如何將性能提高到每秒 8.7 億對數(shù).
特殊情況:負數(shù)、負無窮大和帶有負符號位的 NaN
被視為非常接近 0(導(dǎo)致一些垃圾大的負對數(shù)"值).正無窮大和帶有正符號位的 NaN
產(chǎn)生大約 1024 的對數(shù).如果您不喜歡特殊情況的處理方式,一種選擇是添加代碼來檢查它們并執(zhí)行適合您的操作更好的.這會使計算速度變慢.
命名空間{//限制為 19,因為我們只處理雙精度的高 32 位,并且超出//那里有 20 位尾數(shù),1 位用于四舍五入.constexpr uint8_t cnLog2TblBits = 10;//1024 個數(shù)字乘以 8 個字節(jié) = 8KB.constexpr uint16_t cZeroExp = 1023;const __m256i gDoubleNotExp = _mm256_set1_epi64x(~(0x7ffULL << 52));const __m256d gDoubleExp0 = _mm256_castsi256_pd(_mm256_set1_epi64x(1023ULL << 52));const __m256i cAvxExp2YMask = _mm256_set1_epi64x(~((1ULL << (52-cnLog2TblBits)) - 1) );const __m256d cPlusBit = _mm256_castsi256_pd(_mm256_set1_epi64x(1ULL<<(52 - cnLog2TblBits - 1)));const __m256d gCommMul1 = _mm256_set1_pd(2.0/0.693147180559945309417);//2.0/ln(2)const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);const __m128i cSseMantTblMask = _mm_set1_epi32((1 << cnLog2TblBits) - 1);const __m128i gExpNorm0 = _mm_set1_epi32(1023);//加上|cnLog2TblBits|第一個最高尾數(shù)位雙 gPlusLog2Table[1 <<cnLog2TblBits];}//匿名命名空間void InitLog2Table() {for(uint32_t i=0; i<(1<<cnLog2TblBits); i++) {const uint64_t iZp = (uint64_t(cZeroExp) << 52)|(uint64_t(i) << (52 - cnLog2TblBits)) |(1ULL << (52 - cnLog2TblBits - 1));const double zp = *reinterpret_cast(&iZp);const double l2zp = std::log2(zp);gPlusLog2Table[i] = l2zp;}}__m256d __vectorcall Log2TblPlus(__m256d x) {const __m256d zClearExp = _mm256_and_pd(_mm256_castsi256_pd(gDoubleNotExp), x);const __m256d z = _mm256_or_pd(zClearExp, gDoubleExp0);const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(_mm256_castpd_si256(x), gHigh32Permute));//這就要求x非負,因為符號位之前沒有清零//計算指數(shù).const __m128i exps32 = _mm_srai_epi32(high32, 20);const __m128i normExps = _mm_sub_epi32(exps32, gExpNorm0);//計算 y 大約等于 log2(z)const __m128i 索引 = _mm_and_si128(cSseMantTblMask,_mm_srai_epi32(high32, 20 - cnLog2TblBits));const __m256d y = _mm256_i32gather_pd(gPlusLog2Table,索引,/*每項的字節(jié)數(shù)*/8);//將 A 計算為 z/exp2(y)const __m256d exp2_Y = _mm256_or_pd(cPlusBit, _mm256_and_pd(z, _mm256_castsi256_pd(cAvxExp2YMask)));//計算 t=(A-1)/(A+1).分子和分母都將被 exp2_Y 除const __m256d tNum = _mm256_sub_pd(z, exp2_Y);const __m256d tDen = _mm256_add_pd(z, exp2_Y);//從 https://en.wikipedia.org/wiki/Logarithm#Power_series 的更高效系列"計算第一個多項式項const __m256d t = _mm256_div_pd(tNum, tDen);const __m256d log2_z = _mm256_fmadd_pd(t, gCommMul1, y);//對數(shù)的前導(dǎo)整數(shù)部分const __m256d 前導(dǎo) = _mm256_cvtepi32_pd(normExps);const __m256d log2_x = _mm256_add_pd(log2_z,領(lǐng)先);返回 log2_x;}
它結(jié)合了查找表方法和一階多項式,主要在維基百科上進行了描述(鏈接在代碼注釋中).我可以在這里分配 8KB 的 L1 緩存(這是每個邏輯核心可用的 16KB L1 緩存的一半),因為對數(shù)計算對我來說確實是瓶頸,并且沒有更多的東西需要 L1 緩存.
但是,如果您需要更多一級緩存來滿足其他需求,您可以通過將 cnLog2TblBits
減少到例如,減少對數(shù)算法使用的緩存量5 以降低對數(shù)計算精度為代價.
或者為了保持較高的準(zhǔn)確度,您可以通過添加以下內(nèi)容來增加多項式項的數(shù)量:
命名空間{//...const __m256d gCoeff1 = _mm256_set1_pd(1.0/3);const __m256d gCoeff2 = _mm256_set1_pd(1.0/5);const __m256d gCoeff3 = _mm256_set1_pd(1.0/7);const __m256d gCoeff4 = _mm256_set1_pd(1.0/9);const __m256d gCoeff5 = _mm256_set1_pd(1.0/11);}
然后在 const __m256d t = _mm256_div_pd(tNum, tDen);
行之后改變 Log2TblPlus()
的尾部:
const __m256d t2 = _mm256_mul_pd(t, t);//t**2const __m256d t3 = _mm256_mul_pd(t, t2);//t**3const __m256d term01 = _mm256_fmadd_pd(gCoeff1, t3, t);const __m256d t5 = _mm256_mul_pd(t3, t2);//t**5const __m256d term012 = _mm256_fmadd_pd(gCoeff2,t5,terms01);const __m256d t7 = _mm256_mul_pd(t5, t2);//t**7const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3,t7,terms012);const __m256d t9 = _mm256_mul_pd(t7, t2);//t**9const __m256d 條款01234 = _mm256_fmadd_pd(gCoeff4,t9,條款0123);const __m256d t11 = _mm256_mul_pd(t9, t2);//t**11const __m256d 條款012345 = _mm256_fmadd_pd(gCoeff5,t11,條款01234);const __m256d log2_z = _mm256_fmadd_pd(terms012345, gCommMul1, y);
然后注釋//對數(shù)的前導(dǎo)整數(shù)部分
,其余部分不變.
通常你不需要那么多項,即使是幾位表,我只是提供系數(shù)和計算以供參考.很可能如果cnLog2TblBits==5
,除了terms012
,你不需要任何東西.但是我還沒有做過這樣的測量,你需要根據(jù)自己的需要試驗一下.
顯然,您計算的多項式項越少,計算速度就越快.
<小時>編輯:這個問題在什么情況下 AVX2 收集指令會比單獨加載數(shù)據(jù)更快? 表明如果
,您可能會獲得性能改進const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, 索引,/*每項的字節(jié)數(shù)*/8);
被替換為
const __m256d y = _mm256_set_pd(gPlusLog2Table[indexes.m128i_u32[3]],gPlusLog2Table[indexes.m128i_u32[2]],gPlusLog2Table[indexes.m128i_u32[1]],gPlusLog2Table[indexes.m128i_u32[0]]);
對于我的實現(xiàn),它節(jié)省了大約 1.5 個周期,將計算 4 個對數(shù)的總周期數(shù)從 18 減少到 16.5,因此性能提高到每秒 8.7 億對數(shù).我將保留當(dāng)前的實現(xiàn),因為一旦 CPU 開始正確地執(zhí)行 gather
操作(像 GPU 一樣進行合并),它會更加慣用并且應(yīng)該會更快.
EDIT2:在 Ryzen CPU 上(但不是在 Intel 上) 你可以通過替換獲得更多的加速(大約 0.5 個周期)
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(_mm256_castpd_si256(x), gHigh32Permute));
與
const __m128 hiLane = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));const __m128 loLane = _mm_castpd_ps(_mm256_castpd256_pd128(x));const __m128i high32 = _mm_castps_si128(_mm_shuffle_ps(loLane, hiLane,_MM_SHUFFLE(3, 1, 3, 1)));
SVML's __m256d _mm256_log2_pd (__m256d a)
is not available on other compilers than Intel, and they say its performance is handicapped on AMD processors. There are some implementations on the internet referred in AVX log intrinsics (_mm256_log_ps) missing in g++-4.8? and SIMD math libraries for SSE and AVX , however they seem to be more SSE than AVX2. There's also Agner Fog's vector library , however it's a large library having much more stuff that just vector log2, so from the implementation in it it's hard to figure out the essential parts for just the vector log2 operation.
So can someone just explain how to implement log2()
operation for a vector of 4 double
numbers efficiently? I.e. like what __m256d _mm256_log2_pd (__m256d a)
does, but available for other compilers and reasonably efficient for both AMD and Intel processors.
EDIT: In my current specific case, the numbers are probabilities between 0 and 1, and logarithm is used for entropy computation: the negation of sum over all i
of P[i]*log(P[i])
. The range of floating-point exponents for P[i]
is large, so the numbers can be close to 0. I'm not sure about accuracy, so would consider any solution starting with 30 bits of mantissa, especially a tuneable solution is preferred.
EDIT2: here is my implementation so far, based on "More efficient series" from https://en.wikipedia.org/wiki/Logarithm#Power_series . How can it be improved? (both performance and accuracy improvements are desired)
namespace {
const __m256i gDoubleExpMask = _mm256_set1_epi64x(0x7ffULL << 52);
const __m256i gDoubleExp0 = _mm256_set1_epi64x(1023ULL << 52);
const __m256i gTo32bitExp = _mm256_set_epi32(0, 0, 0, 0, 6, 4, 2, 0);
const __m128i gExpNormalizer = _mm_set1_epi32(1023);
//TODO: some 128-bit variable or two 64-bit variables here?
const __m256d gCommMul = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
const __m256d gVect1 = _mm256_set1_pd(1.0);
}
__m256d __vectorcall Log2(__m256d x) {
const __m256i exps64 = _mm256_srli_epi64(_mm256_and_si256(gDoubleExpMask, _mm256_castpd_si256(x)), 52);
const __m256i exps32_avx = _mm256_permutevar8x32_epi32(exps64, gTo32bitExp);
const __m128i exps32_sse = _mm256_castsi256_si128(exps32_avx);
const __m128i normExps = _mm_sub_epi32(exps32_sse, gExpNormalizer);
const __m256d expsPD = _mm256_cvtepi32_pd(normExps);
const __m256d y = _mm256_or_pd(_mm256_castsi256_pd(gDoubleExp0),
_mm256_andnot_pd(_mm256_castsi256_pd(gDoubleExpMask), x));
// Calculate t=(y-1)/(y+1) and t**2
const __m256d tNum = _mm256_sub_pd(y, gVect1);
const __m256d tDen = _mm256_add_pd(y, gVect1);
const __m256d t = _mm256_div_pd(tNum, tDen);
const __m256d t2 = _mm256_mul_pd(t, t); // t**2
const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);
const __m256d log2_y = _mm256_mul_pd(terms01234, gCommMul);
const __m256d log2_x = _mm256_add_pd(log2_y, expsPD);
return log2_x;
}
So far my implementation gives 405 268 490 operations per second, and it seems precise till the 8th digit. The performance is measured with the following function:
#include <chrono>
#include <cmath>
#include <cstdio>
#include <immintrin.h>
// ... Log2() implementation here
const int64_t cnLogs = 100 * 1000 * 1000;
void BenchmarkLog2Vect() {
__m256d sums = _mm256_setzero_pd();
auto start = std::chrono::high_resolution_clock::now();
for (int64_t i = 1; i <= cnLogs; i += 4) {
const __m256d x = _mm256_set_pd(double(i+3), double(i+2), double(i+1), double(i));
const __m256d logs = Log2(x);
sums = _mm256_add_pd(sums, logs);
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
double sum = sums.m256d_f64[0] + sums.m256d_f64[1] + sums.m256d_f64[2] + sums.m256d_f64[3];
printf("Vect Log2: %.3lf Ops/sec calculated %.3lf
", cnLogs / nSec, sum);
}
Comparing to the results of Logarithm in C++ and assembly , the current vector implementation is 4 times faster than std::log2()
and 2.5 times faster than std::log()
.
Specifically, the following approximation formula is used:
Finally here is my best result which on Ryzen 1800X @3.6GHz gives about 0.8 billion of logarithms per second (200 million vectors of 4 logarithms in each) in a single thread, and is accurate till a few last bits in the mantissa. Spoiler: see in the end how to increase performance to 0.87 billion logarithms per second.
Special cases:
Negative numbers, negative infinity and NaN
s with negative sign bit are treated as if they are very close to 0 (result in some garbage large negative "logarithm" values). Positive infinity and NaN
s with positive sign bit result in a logarithm around 1024. If you don't like how special cases are treated, one option is to add code that checks for them and does what suits you better. This will make the computation slower.
namespace {
// The limit is 19 because we process only high 32 bits of doubles, and out of
// 20 bits of mantissa there, 1 bit is used for rounding.
constexpr uint8_t cnLog2TblBits = 10; // 1024 numbers times 8 bytes = 8KB.
constexpr uint16_t cZeroExp = 1023;
const __m256i gDoubleNotExp = _mm256_set1_epi64x(~(0x7ffULL << 52));
const __m256d gDoubleExp0 = _mm256_castsi256_pd(_mm256_set1_epi64x(1023ULL << 52));
const __m256i cAvxExp2YMask = _mm256_set1_epi64x(
~((1ULL << (52-cnLog2TblBits)) - 1) );
const __m256d cPlusBit = _mm256_castsi256_pd(_mm256_set1_epi64x(
1ULL << (52 - cnLog2TblBits - 1)));
const __m256d gCommMul1 = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
const __m128i cSseMantTblMask = _mm_set1_epi32((1 << cnLog2TblBits) - 1);
const __m128i gExpNorm0 = _mm_set1_epi32(1023);
// plus |cnLog2TblBits|th highest mantissa bit
double gPlusLog2Table[1 << cnLog2TblBits];
} // anonymous namespace
void InitLog2Table() {
for(uint32_t i=0; i<(1<<cnLog2TblBits); i++) {
const uint64_t iZp = (uint64_t(cZeroExp) << 52)
| (uint64_t(i) << (52 - cnLog2TblBits)) | (1ULL << (52 - cnLog2TblBits - 1));
const double zp = *reinterpret_cast<const double*>(&iZp);
const double l2zp = std::log2(zp);
gPlusLog2Table[i] = l2zp;
}
}
__m256d __vectorcall Log2TblPlus(__m256d x) {
const __m256d zClearExp = _mm256_and_pd(_mm256_castsi256_pd(gDoubleNotExp), x);
const __m256d z = _mm256_or_pd(zClearExp, gDoubleExp0);
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
_mm256_castpd_si256(x), gHigh32Permute));
// This requires that x is non-negative, because the sign bit is not cleared before
// computing the exponent.
const __m128i exps32 = _mm_srai_epi32(high32, 20);
const __m128i normExps = _mm_sub_epi32(exps32, gExpNorm0);
// Compute y as approximately equal to log2(z)
const __m128i indexes = _mm_and_si128(cSseMantTblMask,
_mm_srai_epi32(high32, 20 - cnLog2TblBits));
const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
/*number of bytes per item*/ 8);
// Compute A as z/exp2(y)
const __m256d exp2_Y = _mm256_or_pd(
cPlusBit, _mm256_and_pd(z, _mm256_castsi256_pd(cAvxExp2YMask)));
// Calculate t=(A-1)/(A+1). Both numerator and denominator would be divided by exp2_Y
const __m256d tNum = _mm256_sub_pd(z, exp2_Y);
const __m256d tDen = _mm256_add_pd(z, exp2_Y);
// Compute the first polynomial term from "More efficient series" of https://en.wikipedia.org/wiki/Logarithm#Power_series
const __m256d t = _mm256_div_pd(tNum, tDen);
const __m256d log2_z = _mm256_fmadd_pd(t, gCommMul1, y);
// Leading integer part for the logarithm
const __m256d leading = _mm256_cvtepi32_pd(normExps);
const __m256d log2_x = _mm256_add_pd(log2_z, leading);
return log2_x;
}
It uses a combination of lookup table approach and a 1st degree polynomial, mostly described on Wikipedia (the link is in the code comments). I can afford to allocate 8KB of L1 cache here (which is a half of 16KB L1 cache available per logical core), because logarithm computation is really the bottleneck for me and there is not much more anything that needs L1 cache.
However, if you need more L1 cache for the other needs, you can decrease the amount of cache used by logarithm algorithm by reducing cnLog2TblBits
to e.g. 5 at expense of decreasing the accuracy of logarithm computation.
Or to keep the accuracy high, you can increase the number of polynomial terms by adding:
namespace {
// ...
const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
const __m256d gCoeff5 = _mm256_set1_pd(1.0 / 11);
}
And then changing the tail of Log2TblPlus()
after line const __m256d t = _mm256_div_pd(tNum, tDen);
:
const __m256d t2 = _mm256_mul_pd(t, t); // t**2
const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);
const __m256d t11 = _mm256_mul_pd(t9, t2); // t**11
const __m256d terms012345 = _mm256_fmadd_pd(gCoeff5, t11, terms01234);
const __m256d log2_z = _mm256_fmadd_pd(terms012345, gCommMul1, y);
Then comment // Leading integer part for the logarithm
and the rest unchanged follow.
Normally you don't need that many terms, even for a few-bit table, I just provided the coefficients and computations for reference. It's likely that if cnLog2TblBits==5
, you won't need anything beyond terms012
. But I haven't done such measurements, you need to experiment what suits your needs.
The less polynomial terms you compute, obviously, the faster the computations are.
EDIT: this question In what situation would the AVX2 gather instructions be faster than individually loading the data? suggests that you may get a performance improvement if
const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
/*number of bytes per item*/ 8);
is replaced by
const __m256d y = _mm256_set_pd(gPlusLog2Table[indexes.m128i_u32[3]],
gPlusLog2Table[indexes.m128i_u32[2]],
gPlusLog2Table[indexes.m128i_u32[1]],
gPlusLog2Table[indexes.m128i_u32[0]]);
For my implementation it saves about 1.5 cycle, reducing the total cycle count to compute 4 logarithms from 18 to 16.5, thus the performance rises to 0.87 billion logarithms per second. I'm leaving the current implementation as is because it's more idiomatic and shoud be faster once the CPUs start doing gather
operations right (with coalescing like GPUs do).
EDIT2: on Ryzen CPU (but not on Intel) you can get a little more speedup (about 0.5 cycle) by replacing
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
_mm256_castpd_si256(x), gHigh32Permute));
with
const __m128 hiLane = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
const __m128 loLane = _mm_castpd_ps(_mm256_castpd256_pd128(x));
const __m128i high32 = _mm_castps_si128(_mm_shuffle_ps(loLane, hiLane,
_MM_SHUFFLE(3, 1, 3, 1)));
這篇關(guān)于AVX2中l(wèi)og2(__m256d)的高效實現(xiàn)的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!