pbootcms网站模板|日韩1区2区|织梦模板||网站源码|日韩1区2区|jquery建站特效-html5模板网

boost::flat_map 及其與 map 和 unordered_map 相比的性能

boost::flat_map and its performance compared to map and unordered_map(boost::flat_map 及其與 map 和 unordered_map 相比的性能)
本文介紹了boost::flat_map 及其與 map 和 unordered_map 相比的性能的處理方法,對(duì)大家解決問題具有一定的參考價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)吧!

問題描述

編程中的常識(shí)是,由于緩存命中,內(nèi)存局部性可以大大提高性能.我最近發(fā)現(xiàn)了 boost::flat_map,它是一種基于矢量的地圖實(shí)現(xiàn).它似乎沒有您典型的 map/unordered_map 流行,所以我找不到任何性能比較.它如何比較以及它的最佳用例是什么?

謝謝!

解決方案

我最近在我的公司針對(duì)不同的數(shù)據(jù)結(jié)構(gòu)運(yùn)行了一個(gè)基準(zhǔn)測(cè)試,所以我覺得我需要說幾句.正確地進(jìn)行基準(zhǔn)測(cè)試非常復(fù)雜.

基準(zhǔn)測(cè)試

在網(wǎng)絡(luò)上,我們很少找到(如果有的話)精心設(shè)計(jì)的基準(zhǔn)測(cè)試.直到今天,我只找到了以記者的方式完成的基準(zhǔn)測(cè)試(非常快,涵蓋了數(shù)十個(gè)變量).

1)你需要考慮緩存預(yù)熱

大多數(shù)運(yùn)行基準(zhǔn)測(cè)試的人都害怕計(jì)時(shí)器差異,因此他們運(yùn)行他們的東西數(shù)千次并占用全部時(shí)間,他們只是小心翼翼地為每次操作采取相同的數(shù)千次,然后再考慮可比性.

>

事實(shí)是,在現(xiàn)實(shí)世界中它沒有什么意義,因?yàn)槟木彺娌粫?huì)是熱的,并且您的操作可能只會(huì)被調(diào)用一次.因此,您需要使用 RDTSC 進(jìn)行基準(zhǔn)測(cè)試,并且只計(jì)算一次調(diào)用它們的時(shí)間.英特爾發(fā)表了一篇論文 描述 如何使用 RDTSC(使用 cpuid 指令刷新管道,并在程序開始時(shí)至少調(diào)用 3 次以使其穩(wěn)定).

2) RDTSC 準(zhǔn)確度測(cè)量

我也建議這樣做:

u64 g_correctionFactor;//每次測(cè)量后要偏移的時(shí)鐘數(shù),以消除測(cè)量器本身的開銷.u64 g_accuracy;靜態(tài) u64 const errormeasure = ~((u64)0);#ifdef _MSC_VER#pragma 內(nèi)在(__rdtsc)內(nèi)聯(lián) u64 GetRDTSC(){int a[4];__cpuid(a, 0x80000000);//刷新 OOO 指令管道返回 __rdtsc();}內(nèi)聯(lián) void WarmupRDTSC(){int a[4];__cpuid(a, 0x80000000);//預(yù)熱 cpuid.__cpuid(a, 0x80000000);__cpuid(a, 0x80000000);//用測(cè)量器測(cè)量測(cè)量器開銷(瘋狂他..)u64 minDiff = LLONG_MAX;u64 maxDiff = 0;//這將有助于計(jì)算我們的 PRECISION ERROR MARGINfor (int i = 0; i <80; ++i){u64 tick1 = GetRDTSC();u64 tick2 = GetRDTSC();minDiff = std::min(minDiff, tick2 - tick1);//進(jìn)行多次拍攝,取最小的一次.maxDiff = std::max(maxDiff, tick2 - tick1);}g_correctionFactor = minDiff;printf("校正因子 %llu 時(shí)鐘
", g_correctionFactor);g_accuracy = maxDiff - minDiff;printf("測(cè)量精度(以時(shí)鐘為單位):%llu
", g_accuracy);}#萬一

這是一個(gè)差異測(cè)量器,它將取所有測(cè)量值中的最小值,以避免不時(shí)得到 -10**18(64 位第一個(gè)負(fù)值).

注意使用內(nèi)在函數(shù)而不是內(nèi)聯(lián)匯編.現(xiàn)在編譯器很少支持第一個(gè)內(nèi)聯(lián)匯編,但更糟糕的是,編譯器在內(nèi)聯(lián)匯編周圍創(chuàng)建了一個(gè)完整的排序屏障,因?yàn)樗荒莒o態(tài)分析內(nèi)部,所以這是對(duì)現(xiàn)實(shí)世界的東西進(jìn)行基準(zhǔn)測(cè)試的問題,尤其是在調(diào)用東西時(shí)就一次.因此,內(nèi)在函數(shù)適合這里,因?yàn)樗粫?huì)破壞編譯器對(duì)指令的自由重新排序.

3) 參數(shù)

最后一個(gè)問題是人們通常測(cè)試場(chǎng)景的變體太少.容器性能受以下因素影響:

  1. 分配器
  2. 包含類型的大小
  3. 所包含類型的復(fù)制操作、賦值操作、移動(dòng)操作、構(gòu)造操作的實(shí)現(xiàn)成本.
  4. 容器中的元素?cái)?shù)量(問題的大小)
  5. type 有簡(jiǎn)單的 3.-operations
  6. 類型是POD

第 1 點(diǎn)很重要,因?yàn)槿萜鲿?huì)不時(shí)進(jìn)行分配,如果它們使用 CRTnew"進(jìn)行分配,這一點(diǎn)很重要.或一些用戶定義的操作,如池分配或空閑列表或其他...

(對(duì)于 pt 1 感興趣的人,的性能進(jìn)行測(cè)量,我測(cè)量了在某些 std::unordered_map 用例上,Windows 7 和 Windows 8 之間的性能差距超過 3000%(在此討論).
這讓我想警告讀者上述結(jié)果(它們是在 Win7 上制作的):您的里程可能會(huì)有所不同.

It is common knowledge in programming that memory locality improves performance a lot due to cache hits. I recently found out about boost::flat_map which is a vector based implementation of a map. It doesn't seem to be nearly as popular as your typical map/unordered_map so I haven't been able to find any performance comparisons. How does it compare and what are the best use cases for it?

Thanks!

解決方案

I have run a benchmark on different data structures very recently at my company so I feel I need to drop a word. It is very complicated to benchmark something correctly.

Benchmarking

On the web, we rarely find (if ever) a well-engineered benchmark. Until today I only found benchmarks that were done the journalist way (pretty quickly and sweeping dozens of variables under the carpet).

1) You need to consider cache warming

Most people running benchmarks are afraid of timer discrepancy, therefore they run their stuff thousands of times and take the whole time, they just are careful to take the same thousand of times for every operation, and then consider that comparable.

The truth is, in the real world it makes little sense, because your cache will not be warm, and your operation will likely be called just once. Therefore you need to benchmark using RDTSC, and time stuff calling them once only. Intel has made a paper describing how to use RDTSC (using a cpuid instruction to flush the pipeline, and calling it at least 3 times at the beginning of the program to stabilize it).

2) RDTSC accuracy measure

I also recommend doing this:

u64 g_correctionFactor;  // number of clocks to offset after each measurement to remove the overhead of the measurer itself.
u64 g_accuracy;

static u64 const errormeasure = ~((u64)0);

#ifdef _MSC_VER
#pragma intrinsic(__rdtsc)
inline u64 GetRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // flush OOO instruction pipeline
    return __rdtsc();
}

inline void WarmupRDTSC()
{
    int a[4];
    __cpuid(a, 0x80000000);  // warmup cpuid.
    __cpuid(a, 0x80000000);
    __cpuid(a, 0x80000000);
    
    // measure the measurer overhead with the measurer (crazy he..)
    u64 minDiff = LLONG_MAX;
    u64 maxDiff = 0;   // this is going to help calculate our PRECISION ERROR MARGIN
    for (int i = 0; i < 80; ++i)
    {
        u64 tick1 = GetRDTSC();
        u64 tick2 = GetRDTSC();
        minDiff = std::min(minDiff, tick2 - tick1);   // make many takes, take the smallest that ever come.
        maxDiff = std::max(maxDiff, tick2 - tick1);
    }
    g_correctionFactor = minDiff;

    printf("Correction factor %llu clocks
", g_correctionFactor);

    g_accuracy = maxDiff - minDiff;
    printf("Measurement Accuracy (in clocks) : %llu
", g_accuracy);
}
#endif

This is a discrepancy measurer, and it will take the minimum of all measured values, to avoid getting a -10**18 (64 bits first negatives values) from time to time.

Notice the use of intrinsics and not inline assembly. First inline assembly is rarely supported by compilers nowadays, but much worse of all, the compiler creates a full ordering barrier around inline assembly because it cannot static analyze the inside, so this is a problem to benchmark real-world stuff, especially when calling stuff just once. So an intrinsic is suited here because it doesn't break the compiler free-re-ordering of instructions.

3) parameters

The last problem is people usually test for too few variations of the scenario. A container performance is affected by:

  1. Allocator
  2. size of the contained type
  3. cost of implementation of the copy operation, assignment operation, move operation, construction operation, of the contained type.
  4. number of elements in the container (size of the problem)
  5. type has trivial 3.-operations
  6. type is POD

Point 1 is important because containers do allocate from time to time, and it matters a lot if they allocate using the CRT "new" or some user-defined operation, like pool allocation or freelist or other...

(for people interested about pt 1, join the mystery thread on gamedev about system allocator performance impact)

Point 2 is because some containers (say A) will lose time copying stuff around, and the bigger the type the bigger the overhead. The problem is that when comparing to another container B, A may win over B for small types, and lose for larger types.

Point 3 is the same as point 2, except it multiplies the cost by some weighting factor.

Point 4 is a question of big O mixed with cache issues. Some bad-complexity containers can largely outperform low-complexity containers for a small number of types (like map vs. vector, because their cache locality is good, but map fragments the memory). And then at some crossing point, they will lose, because the contained overall size starts to "leak" to main memory and cause cache misses, that plus the fact that the asymptotic complexity can start to be felt.

Point 5 is about compilers being able to elide stuff that are empty or trivial at compile time. This can optimize greatly some operations because the containers are templated, therefore each type will have its own performance profile.

Point 6 same as point 5, PODs can benefit from the fact that copy construction is just a memcpy, and some containers can have a specific implementation for these cases, using partial template specializations, or SFINAE to select algorithms according to traits of T.

About the flat map

Apparently, the flat map is a sorted vector wrapper, like Loki AssocVector, but with some supplementary modernizations coming with C++11, exploiting move semantics to accelerate insert and delete of single elements.

This is still an ordered container. Most people usually don't need the ordering part, therefore the existence of unordered...

Have you considered that maybe you need a flat_unorderedmap? which would be something like google::sparse_map or something like that—an open address hash map.

The problem of open address hash maps is that at the time of rehash they have to copy everything around to the new extended flat land, whereas a standard unordered map just has to recreate the hash index, while the allocated data stays where it is. The disadvantage of course is that the memory is fragmented like hell.

The criterion of a rehash in an open address hash map is when the capacity exceeds the size of the bucket vector multiplied by the load factor.

A typical load factor is 0.8; therefore, you need to care about that, if you can pre-size your hash map before filling it, always pre-size to: intended_filling * (1/0.8) + epsilon this will give you a guarantee of never having to spuriously rehash and recopy everything during filling.

The advantage of closed address maps (std::unordered..) is that you don't have to care about those parameters.

But the boost::flat_map is an ordered vector; therefore, it will always have a log(N) asymptotic complexity, which is less good than the open address hash map (amortized constant time). You should consider that as well.

Benchmark results

This is a test involving different maps (with int key and __int64/somestruct as value) and std::vector.

tested types information:

typeid=__int64 .  sizeof=8 . ispod=yes
typeid=struct MediumTypePod .  sizeof=184 . ispod=yes

Insertion

EDIT:

My previous results included a bug: they actually tested ordered insertion, which exhibited a very fast behavior for the flat maps.
I left those results later down this page because they are interesting.
This is the correct test:

I have checked the implementation, there is no such thing as a deferred sort implemented in the flat maps here. Each insertion sorts on the fly, therefore this benchmark exhibits the asymptotic tendencies:

map: O(N * log(N))
hashmaps: O(N)
vector and flatmaps: O(N * N)

Warning: hereafter the 2 tests for std::map and both flat_maps are buggy and actually test ordered insertion (vs random insertion for other containers. yes it's confusing sorry):

We can see that ordered insertion, results in back pushing, and is extremely fast. However, from the non-charted results of my benchmark, I can also say that this is not near the absolute optimality for a back-insertion. At 10k elements, perfect back-insertion optimality is obtained on a pre-reserved vector. Which gives us 3Million cycles; we observe 4.8M here for the ordered insertion into the flat_map (therefore 160% of the optimal).

Analysis: remember this is 'random insert' for the vector, so the massive 1 billion cycles come from having to shift half (in average) the data upward (one element by one element) at each insertion.

Random search of 3 elements (clocks renormalized to 1)

in size = 100

in size = 10000

Iteration

over size 100 (only MediumPod type)

over size 10000 (only MediumPod type)

Final grain of salt

In the end, I wanted to come back on "Benchmarking §3 Pt1" (the system allocator). In a recent experiment, I am doing around the performance of an open address hash map I developed, I measured a performance gap of more than 3000% between Windows 7 and Windows 8 on some std::unordered_map use cases (discussed here).
This makes me want to warn the reader about the above results (they were made on Win7): your mileage may vary.

這篇關(guān)于boost::flat_map 及其與 map 和 unordered_map 相比的性能的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!

【網(wǎng)站聲明】本站部分內(nèi)容來源于互聯(lián)網(wǎng),旨在幫助大家更快的解決問題,如果有圖片或者內(nèi)容侵犯了您的權(quán)益,請(qǐng)聯(lián)系我們刪除處理,感謝您的支持!

相關(guān)文檔推薦

Assertion failed (size.widthgt;0 amp;amp; size.heightgt;0)(斷言失敗(size.width0 amp;amp; size.height0))
Rotate an image in C++ without using OpenCV functions(在 C++ 中旋轉(zhuǎn)圖像而不使用 OpenCV 函數(shù))
OpenCV: process every frame(OpenCV:處理每一幀)
Why can#39;t I open avi video in openCV?(為什么我不能在 openCV 中打開 avi 視頻?)
OpenCV unable to set up SVM Parameters(OpenCV 無法設(shè)置 SVM 參數(shù))
Convert a single color with cvtColor(使用 cvtColor 轉(zhuǎn)換單一顏色)
主站蜘蛛池模板: QQ房产导航-免费收录优秀房地产网站_房地产信息网 | 户外-组合-幼儿园-不锈钢-儿童-滑滑梯-床-玩具-淘气堡-厂家-价格 | 地脚螺栓_材质_标准-永年县德联地脚螺栓厂家| 博医通医疗器械互联网供应链服务平台_博医通 | 上海赞永| 合肥白癜风医院_[治疗白癜风]哪家好_合肥北大白癜风医院 | 无锡网站建设-做网站-建网站-网页设计制作-阿凡达建站公司 | RTO换向阀_VOC高温阀门_加热炉切断阀_双偏心软密封蝶阀_煤气蝶阀_提升阀-湖北霍科德阀门有限公司 | 温湿度记录纸_圆盘_横河记录纸|霍尼韦尔记录仪-广州汤米斯机电设备有限公司 | DWS物流设备_扫码称重量方一体机_快递包裹分拣机_广东高臻智能装备有限公司 | 煤粉取样器-射油器-便携式等速飞灰取样器-连灵动 | 水压力传感器_数字压力传感器|佛山一众传感仪器有限公司|首页 | 阻垢剂-反渗透缓蚀阻垢剂厂家-山东鲁东环保科技有限公司 | 南京雕塑制作厂家-不锈钢雕塑制作-玻璃钢雕塑制作-先登雕塑厂 | 蔬菜清洗机_环速洗菜机_异物去除清洗机_蔬菜清洗机_商用洗菜机 - 环速科技有限公司 | 制丸机,小型中药制丸机,全自动制丸机价格-甘肃恒跃制药设备有限公司 | 上海logo设计| MVE振动电机_MVE震动电机_MVE卧式振打电机-河南新乡德诚生产厂家 | 新疆系统集成_新疆系统集成公司_系统集成项目-新疆利成科技 | 齿轮减速电机一体机_蜗轮蜗杆减速马达-德国BOSERL齿轮减速机带电机生产厂家 | 大型低温冷却液循环泵-低温水槽冷阱「厂家品牌」京华仪器_京华仪器 | 直读光谱仪,光谱分析仪,手持式光谱仪,碳硫分析仪,创想仪器官网 | 钢化玻璃膜|手机钢化膜|钢化膜厂家|手机保护膜-【东莞市大象电子科技有限公司】 | 工业风机_环保空调_冷风机_工厂车间厂房通风降温设备旺成服务平台 | 高精度电阻回路测试仪-回路直流电阻测试仪-武汉特高压电力科技有限公司 | 橡胶接头_橡胶软接头_可曲挠橡胶接头-巩义市创伟机械制造有限公司 | 无锡市珂妮日用化妆品有限公司|珂妮日化官网|洗手液厂家 | 微型实验室真空泵-无油干式真空泵-微型涡旋耐腐蚀压缩机-思科涡旋科技(杭州)有限公司 | 纯水设备_苏州皙全超纯水设备水处理设备生产厂家 | 电车线(用于供电给电车的输电线路)-百科 | 南京欧陆电气股份有限公司-风力发电机官网 | 贝朗斯动力商城(BRCPOWER.COM) - 买叉车蓄电池上贝朗斯商城,价格更超值,品质有保障! | 捷码低代码平台 - 3D数字孪生_大数据可视化开发平台「免费体验」 | 基本型顶空进样器-全自动热脱附解吸仪价格-AutoHS全模式-成都科林分析技术有限公司 | 协议书_协议合同格式模板范本大全 | 泰安塞纳春天装饰公司【网站】 | 衬四氟_衬氟储罐_四氟储罐-无锡市氟瑞特防腐科技有限公司 | 高精度电阻回路测试仪-回路直流电阻测试仪-武汉特高压电力科技有限公司 | 苏州同创电子有限公司 - 四探针测试仪源头厂家 | 超高频感应加热设备_高频感应电源厂家_CCD视觉检测设备_振动盘视觉检测设备_深圳雨滴科技-深圳市雨滴科技有限公司 | 时代北利离心机,实验室离心机,医用离心机,低速离心机DT5-2,美国SKC采样泵-上海京工实业有限公司 工业电炉,台车式电炉_厂家-淄博申华工业电炉有限公司 |