問題描述
我正在編寫一個霍夫曼編碼/解碼工具,并且正在尋找一種有效的方法來存儲為存儲在輸出文件中而創建的霍夫曼樹.
I am writing a Huffman encoding/decoding tool and am looking for an efficient way to store the Huffman tree that is created to store inside of the output file.
目前我正在實施兩個不同的版本.
Currently there are two different versions I am implementing.
- 這個將整個文件逐個字符讀入內存,并為整個文件建立一個頻率表.這只需要輸出一次樹,因此效率不是那么大,除非輸入文件很小.
- 我使用的另一種方法是讀取一塊大小約為 64 KB 的數據,然后對其運行頻率分析,創建一棵樹并對其進行編碼.但是,在這種情況下,在每個塊之前,我需要輸出我的頻率樹,以便解碼器能夠重新構建其樹并正確解碼編碼文件.這是效率確實發揮作用的地方,因為我想節省盡可能多的空間.
到目前為止,在我的搜索中,我還沒有找到在盡可能小的空間內存儲樹的好方法,我希望 StackOverflow 社區可以幫助我找到一個好的解決方案!
In my searches so far I have not found a good way of storing the tree in as little space as possible, I am hoping the StackOverflow community can help me find a good solution!
推薦答案
由于您已經需要實現代碼來處理字節組織流/文件之上的逐位層,因此這是我的建議.
Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal.
不要存儲實際頻率,解碼不需要它們.但是,您確實需要實際的樹.
Do not store the actual frequencies, they're not needed for decoding. You do, however, need the actual tree.
所以對于每個節點,從根開始:
So for each node, starting at root:
- 如果葉子節點:輸出 1 位 + N 位字符/字節
- 如果不是葉子節點,則輸出 0 位.然后以相同的方式編碼兩個子節點(先左后右)
要閱讀,請執行以下操作:
To read, do this:
- 讀取位.如果為 1,則讀取 N 位字符/字節,返回其周圍沒有子節點的新節點
- 如果 bit 為 0,以同樣的方式解碼左右子節點,并返回帶有這些子節點的新節點,但沒有值
葉節點基本上是任何沒有子節點的節點.
A leaf-node is basically any node that doesn't have children.
使用這種方法,您可以在寫入之前計算輸出的確切大小,以確定收益是否足以證明付出的努力.這假設您有一個包含每個字符出現頻率的鍵/值對字典,其中頻率是實際出現的次數.
With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurrences.
計算偽代碼:
Tree-size = 10 * NUMBER_OF_CHARACTERS - 1
Encoded-size = Sum(for each char,freq in table: freq * len(PATH(char)))
樹大小計算考慮了葉子節點和非葉子節點,內聯節點比字符少一個.
The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters.
SIZE_OF_ONE_CHARACTER 將是位數,這兩個將為您提供我對樹的方法 + 編碼數據將占用的總位數.
SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy.
PATH(c) 是一個函數/表,它會產生從根到樹中那個字符的位路徑.
PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree.
這是一個看起來像 C# 的偽代碼,它假設一個字符只是一個簡單的字節.
Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte.
void EncodeNode(Node node, BitWriter writer)
{
if (node.IsLeafNode)
{
writer.WriteBit(1);
writer.WriteByte(node.Value);
}
else
{
writer.WriteBit(0);
EncodeNode(node.LeftChild, writer);
EncodeNode(node.Right, writer);
}
}
讀回:
Node ReadNode(BitReader reader)
{
if (reader.ReadBit() == 1)
{
return new Node(reader.ReadByte(), null, null);
}
else
{
Node leftChild = ReadNode(reader);
Node rightChild = ReadNode(reader);
return new Node(0, leftChild, rightChild);
}
}
示例(簡化、使用屬性等)節點實現:
An example (simplified, use properties, etc.) Node implementation:
public class Node
{
public Byte Value;
public Node LeftChild;
public Node RightChild;
public Node(Byte value, Node leftChild, Node rightChild)
{
Value = value;
LeftChild = leftChild;
RightChild = rightChild;
}
public Boolean IsLeafNode
{
get
{
return LeftChild == null;
}
}
}
<小時>
這是一個特定示例的示例輸出.
Here's a sample output from a specific example.
輸入:AAAAAABCCCCCCDDEEEEE
Input: AAAAAABCCCCCCDDEEEEE
頻率:
- 答:6
- 乙:1
- C:6
- D:2
- E:5
每個字符只有 8 位,因此樹的大小將為 10 * 5 - 1 = 49 位.
Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits.
樹看起來像這樣:
20
----------
| 8
| -------
12 | 3
----- | -----
A C E B D
6 6 5 1 2
所以每個字符的路徑如下(0左,1右):
So the paths to each character is as follows (0 is left, 1 is right):
- 答:00
- 乙:110
- C:01
- D:111
- E:10
所以要計算輸出大小:
- A:6 次出現 * 2 位 = 12 位
- B:1 次出現 * 3 位 = 3 位
- C:出現 6 次 * 2 位 = 12 位
- D:出現 2 次 * 3 位 = 6 位
- E:出現 5 次 * 2 位 = 10 位
編碼字節的總和為 12+3+12+6+10 = 43 位
Sum of encoded bytes is 12+3+12+6+10 = 43 bits
將其添加到樹中的 49 位,輸出將為 92 位或 12 字節.與存儲未編碼的原始 20 個字符所需的 20 * 8 個字節相比,您將節省 8 個字節.
Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes.
最終輸出,包括開始的樹,如下所示.流 (A-E) 中的每個字符都被編碼為 8 位,而 0 和 1 只是一個位.流中的空間只是將樹與編碼數據分開,不占用最終輸出中的任何空間.
The final output, including the tree to begin with, is as follows. Each character in the stream (A-E) is encoded as 8 bits, whereas 0 and 1 is just a single bit. The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output.
001A1C01E01B1D 0000000000001100101010101011111111010101010
<小時>
對于你在評論中的具體例子,AABCDEF,你會得到這個:
For the concrete example you have in the comments, AABCDEF, you will get this:
輸入:AABCDEF
頻率:
- 答:2
- 乙:1
- C:1
- D:1
- E:1
- F:1
樹:
7
-------------
| 4
| ---------
3 2 2
----- ----- -----
A B C D E F
2 1 1 1 1 1
路徑:
- 答:00
- 乙:01
- C:100
- D:101
- E:110
- F:111
樹:001A1B001C1D01E1F = 59 位
數據:000001100101110111 = 18位
總和:59 + 18 = 77 位 = 10 字節
Tree: 001A1B001C1D01E1F = 59 bits
Data: 000001100101110111 = 18 bits
Sum: 59 + 18 = 77 bits = 10 bytes
由于原來是 7 個字符的 8 位 = 56,你會因為這么小的數據塊而有太多的開銷.
Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data.
這篇關于存儲哈夫曼樹的有效方式的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!