問題描述
我正在處理一個處理大量推文的項目;目標(biāo)是在我處理重復(fù)項時刪除它們.我有推文 ID,它以 "166471306949304320"
I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"
為此,我一直在使用 HashSet<String>
,它可以正常工作一段時間.但是當(dāng)我達(dá)到大約 1000 萬個項目時,我徹底陷入困境并最終得到一個 GC 錯誤,大概來自重新散列.我嘗試使用
I have been using a HashSet<String>
for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with
tweetids = new HashSet
這讓它走得更遠(yuǎn)一點(diǎn),但仍然非常緩慢(大約 1000 萬,它需要 3 倍的處理時間).我該如何優(yōu)化呢?鑒于我大概知道最后應(yīng)該有多少項目(在這種情況下,大約 20-22 百萬)我應(yīng)該創(chuàng)建一個只重新散列兩次或三次的 HashSet,或者這樣的開銷設(shè)置招致太多的時間懲罰?如果我不使用字符串,或者如果我定義不同的 HashCode 函數(shù)(在這種情況下是字符串的特定實(shí)例,我不知道該怎么做),事情會更好嗎?這部分實(shí)現(xiàn)代碼如下.
and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.
tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
duplicates++;
continue;
}
解決方案
感謝您的建議,我解決了.問題在于哈希表示所需的內(nèi)存量;首先,HashSet<String>
非常龐大且不需要,因為 String.hashCode()
對于這種規(guī)模來說太高了.接下來我嘗試了一個 Trie,但它在超過 100 萬個條目時崩潰了;重新分配數(shù)組是有問題的.我使用 HashSet<Long>
來獲得更好的效果,幾乎成功了,但是速度下降了,最終在處理的最后一段(大約 1900 萬)崩潰了.解決方案是脫離標(biāo)準(zhǔn)庫并使用 Trove.它完成 2200 萬條記錄比根本不檢查重復(fù)要快幾分鐘.最終實(shí)現(xiàn)很簡單,看起來像這樣:
Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String>
was simply enormous and uncalled for because the String.hashCode()
is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long>
to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:
import gnu.trove.set.hash.TLongHashSet;
...
TLongHashSet tweetids; // class variable
...
tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
// inside for(each record)
String twid = (String) tweet_twitter_data.get("id");
if (!(tweetids.add(Long.parseLong(twid)))) {
duplicates++;
continue;
}
推薦答案
您可能希望超越 Java 集合框架.我做了一些內(nèi)存密集型處理,你會遇到幾個問題
You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems
- 大型哈希圖和哈希集的桶數(shù)將導(dǎo)致大量開銷(內(nèi)存).您可以通過使用來影響這一點(diǎn)某種自定義哈希函數(shù)和模數(shù),例如50000
- 字符串在 Java 中使用 16 位字符表示.您可以通過對大多數(shù)腳本使用 utf-8 編碼的字節(jié)數(shù)組來減半.
- HashMap 通常是非常浪費(fèi)的數(shù)據(jù)結(jié)構(gòu),而 HashSet 基本上只是這些數(shù)據(jù)結(jié)構(gòu)的一個薄包裝.
鑒于此,請查看 trove 或 guava 的替代品.此外,您的 id 看起來很長.這些是 64 位的,比字符串表示要小很多.
Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.
您可能要考慮的另一種選擇是使用布隆過濾器(番石榴有一個不錯的實(shí)現(xiàn)).布隆過濾器會告訴您某些東西是否絕對不在集合中,并且如果包含某些東西,則具有合理的確定性(小于 100%).結(jié)合一些基于磁盤的解決方案(例如數(shù)據(jù)庫、mapdb、mecached ......)應(yīng)該可以很好地工作.您可以緩沖傳入的新 id,分批寫入它們,并使用布隆過濾器檢查您是否需要在數(shù)據(jù)庫中查找,從而在大多數(shù)情況下避免昂貴的查找.
An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.
這篇關(guān)于Java:優(yōu)化哈希集以進(jìn)行大規(guī)模重復(fù)檢測的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!