Of course! Here is a complete guide to implementing Huffman coding in Python, including detailed explanations of the concepts, the code, and a full example.

What is Huffman Coding?
Huffman coding is a popular lossless data compression algorithm. Its core idea is to assign variable-length codes to input characters. The key principle is:
More frequent characters get shorter codes, and less frequent characters get longer codes.
This is based on the idea that the total number of bits used to represent a message will be minimized if the most common symbols use the fewest bits.
To achieve this, Huffman coding uses a binary tree, often called a Huffman Tree.

How It Works: The Algorithm
The process has two main phases: Encoding and Decoding.
Phase 1: Building the Huffman Tree (The Encoding Process)
-
Count Frequencies: First, count the frequency of each character in the input data.
-
Create Leaf Nodes: Create a node (or a leaf) for each character, associating it with its frequency.
-
Build the Tree: a. Create a priority queue (a min-heap is perfect for this) and add all the leaf nodes to it. The priority will be the node's frequency. b. While there is more than one node in the queue: i. Remove the two nodes with the lowest frequency from the queue. ii. Create a new internal node with these two nodes as its children. The frequency of this new node is the sum of its children's frequencies. iii. Add the new internal node back into the priority queue. c. The last remaining node in the queue is the root of the Huffman Tree.
-
Generate Codes: Traverse the Huffman Tree from the root to every leaf node.
- When you go left, append a
0to the code. - When you go right, append a
1to the code. - The code for a character is the bit string you've built when you reach its leaf node.
- When you go left, append a
Phase 2: Decoding
Decoding is straightforward if you have the Huffman Tree and the encoded bit string.
- Start at the root of the Huffman Tree.
- Iterate through the encoded bit string, one bit at a time.
- If the bit is
0, move to the left child. - If the bit is
1, move to the right child.
- If the bit is
- When you reach a leaf node, you've found a character. Append this character to your output.
- Return to the root and repeat the process with the next bit in the string until you've processed the entire bit string.
Python Implementation
We'll use Python's built-in heapq module, which provides an efficient implementation of a min-heap.
Data Structures
We need a way to store nodes in our priority queue. We'll create a simple Node class to represent characters and their frequencies.
import heapq
class Node:
"""A node in the Huffman Tree."""
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
# For comparison purposes in the heap
def __lt__(self, other):
return self.freq < other.freq
The __lt__ (less than) method is essential. heapq needs to know how to compare two Node objects to maintain the heap order. We'll compare them based on their frequency.
The Huffman Coding Class
Let's encapsulate all the logic in a clean, reusable class.
import heapq
class HuffmanCoding:
def __init__(self):
# A dictionary to store the Huffman codes for each character
self.heap = []
self.codes = {}
self.reverse_mapping = {}
# Phase 1: Building the Huffman Tree
def _make_frequency_dict(self, text):
"""Counts the frequency of each character in the text."""
frequency = {}
for char in text:
if char not in frequency:
frequency[char] = 0
frequency[char] += 1
return frequency
def _make_heap(self, frequency):
"""Creates a min-heap from the frequency dictionary."""
for key in frequency:
node = Node(key, frequency[key])
heapq.heappush(self.heap, node)
def _merge_nodes(self):
"""Builds the Huffman Tree by merging nodes from the heap."""
while len(self.heap) > 1:
node1 = heapq.heappop(self.heap)
node2 = heapq.heappop(self.heap)
merged = Node(None, node1.freq + node2.freq)
merged.left = node1
merged.right = node2
heapq.heappush(self.heap, merged)
def _make_codes_helper(self, root, current_code):
"""A recursive helper to generate codes from the Huffman Tree."""
if root is None:
return
# If it's a leaf node, we've found a character
if root.char is not None:
self.codes[root.char] = current_code
self.reverse_mapping[current_code] = root.char
return
# Traverse left and right, appending 0 or 1 to the code
self._make_codes_helper(root.left, current_code + "0")
self._make_codes_helper(root.right, current_code + "1")
def _make_codes(self):
"""Generates the Huffman codes from the root of the tree."""
root = heapq.heappop(self.heap)
current_code = ""
self._make_codes_helper(root, current_code)
# Phase 2: Encoding and Decoding
def get_encoded_text(self, text):
"""Encodes the input text using the generated Huffman codes."""
encoded_text = ""
for char in text:
encoded_text += self.codes[char]
return encoded_text
def get_padded_encoded_text(self, encoded_text):
"""Pads the encoded text to make its length a multiple of 8."""
padding_amount = 8 - (len(encoded_text) % 8)
if padding_amount == 8:
padding_amount = 0
# The last byte stores the amount of padding
padded_info = "{0:08b}".format(padding_amount)
padded_encoded_text = padded_info + encoded_text + ("0" * padding_amount)
return padded_encoded_text
def get_byte_array(self, padded_encoded_text):
"""Converts the padded binary string into a byte array."""
b = bytearray()
for i in range(0, len(padded_encoded_text), 8):
byte = padded_encoded_text[i:i+8]
b.append(int(byte, 2))
return b
def compress(self, text):
"""Main function to compress the text."""
frequency = self._make_frequency_dict(text)
self._make_heap(frequency)
self._merge_nodes()
self._make_codes()
encoded_text = self.get_encoded_text(text)
padded_encoded_text = self.get_padded_encoded_text(encoded_text)
byte_array = self.get_byte_array(padded_encoded_text)
return byte_array
def remove_padding(self, padded_encoded_text):
"""Removes the padding added during compression."""
padding_info = padded_encoded_text[:8]
padding_amount = int(padding_info, 2)
# Remove the padding info and the padding itself
encoded_text = padded_encoded_text[8:-padding_amount]
return encoded_text
def decode_text(self, encoded_text):
"""Decodes the encoded text using the reverse mapping."""
current_code = ""
decoded_text = ""
for bit in encoded_text:
current_code += bit
if current_code in self.reverse_mapping:
char = self.reverse_mapping[current_code]
decoded_text += char
current_code = ""
return decoded_text
def decompress(self, byte_array):
"""Main function to decompress the byte array."""
# Convert byte array back to a binary string
bit_string = ""
for byte in byte_array:
bits = bin(byte)[2:].rjust(8, '0')
bit_string += bits
# Remove padding and decode
encoded_text = self.remove_padding(bit_string)
decompressed_text = self.decode_text(encoded_text)
return decompressed_text
Putting It All Together: A Full Example
Here's how you would use the HuffmanCoding class.
# --- Main execution ---
if __name__ == '__main__':
# The text to be compressed
text = "Huffman coding is a data compression algorithm."
print("Original Text:")
print(text)
print("-" * 30)
# --- Compression ---
huffman = HuffmanCoding()
compressed_data = huffman.compress(text)
print("Compressed Data (as a byte array):")
print(compressed_data)
print("-" * 30)
print("Huffman Codes:")
for char, code in huffman.codes.items():
print(f"'{char}': {code}")
print("-" * 30)
# --- Decompression ---
# To decompress, we need the same HuffmanCoding instance
# because it holds the reverse_mapping dictionary.
# In a real application, you would need to save this mapping
# along with the compressed data.
decompressed_text = huffman.decompress(compressed_data)
print("Decompressed Text:")
print(decompressed_text)
print("-" * 30)
# Verification
if text == decompressed_text:
print("Success! The original and decompressed texts match.")
else:
print("Error! The texts do not match.")
Important Considerations for a Real-World Application
The example above works perfectly, but to use Huffman coding for file compression, you need to handle one more thing: saving the Huffman tree (or codes) to the compressed file. Without it, you cannot decompress the data.
The common approach is to serialize the reverse_mapping dictionary (or the tree structure itself) and prepend it to the compressed byte array.
Here's a conceptual idea of how you might save and load the codes:
import pickle # For serializing Python objects
def save_to_file(filename, byte_array, reverse_mapping):
"""Saves the compressed data and the Huffman codes to a file."""
# Serialize the codes dictionary
serialized_codes = pickle.dumps(reverse_mapping)
# Write the length of the codes data first, then the codes, then the data
with open(filename, 'wb') as f:
# Write the length of the serialized codes
f.write(len(serialized_codes).to_bytes(4, 'big'))
# Write the serialized codes
f.write(serialized_codes)
# Write the compressed data
f.write(byte_array)
def load_from_file(filename):
"""Loads the compressed data and Huffman codes from a file."""
with open(filename, 'rb') as f:
# Read the length of the serialized codes
codes_length = int.from_bytes(f.read(4), 'big')
# Read the serialized codes
serialized_codes = f.read(codes_length)
# Deserialize the codes
reverse_mapping = pickle.loads(serialized_codes)
# Read the rest of the file as the compressed data
byte_array = f.read()
return byte_array, reverse_mapping
# You would then integrate this into your main program:
# 1. Compress text -> get byte_array and reverse_mapping
# 2. save_to_file('file.huf', byte_array, reverse_mapping)
#
# 3. To decompress:
# byte_array, reverse_mapping = load_from_file('file.huf')
# huffman = HuffmanCoding()
# huffman.reverse_mapping = reverse_mapping # Load the codes
# decompressed_text = huffman.decompress(byte_array) 