Your Computer Science Resource

PROJECT: Huffman Coding

In the field of computer science, data compression is an extremely important tool used in many different types of applications. One way to compress data is to convert the data to binary code, using a shorter code for more frequently used values and a longer code for less frequently used values. Doing so allows you to save larger values in a much smaller space, thus saving you space overall. This is essentially how Huffman Coding works, and how I have implemented data compression in this application. To download the complete source to this project in C++, click here.

In this specific implementation, we will be reading an input file (to compress), identify the set of characters appearing in the input string and their frequency, and generate a Huffman Code for these characters. We will be using two queues to generate the codes. Once the codes are computed (and thus, the binary tree is finished), we can encode and decode messages. For the purpose of this project, we will encode the input, and then decode the reversed input (this will be gibberish, but that’s OK — these are the instructions, so we must do it.).

What is Huffman Coding?

  • Huffman coding is an entropy encoding algorithm used for lossless data compression.
  • It was developed by David A. Huffman while he was a Ph.D. student at MIT.
  • Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol).

What does this really mean?

In the English language, certain letters appear more often than others. For example, the letter “E” is very common in many words. Now, suppose I needed to compress a document. If I converted all of the letters to binary, they might all be the same length. But if I considered the frequency of the letters, I could compress the document in such a way that the most frequent characters have shorter bit lengths, and the less frequent characters have longer bit lengths, so that “E” is shorter than the letter “X”.

Technique

(For the purpose of our project; your technique may vary).

  1. Read an input and count the character frequencies. When finished, store all of the characters and their respective frequencies in nodes.
  2. Create two queues.
  3. Enqueue all nodes into the first queue by increasing order (this essentially puts the least likely node at the beginning of the queue).
  4. While there are nodes in the first and second queue:
    1. Dequeue the two nodes with the lowest weight (usually the front of the queue(s)).
    2. Create a new node with the two removed nodes. Make this node the parent of the two removed nodes, and the sum of their weights as the value for this parent. Note that the left or right property for the children does not matter (the nodes can be in any order).
    3. Enqueue this parent node into the rear of the second queue.
  5. The last remaining node is the root of the Huffman tree.
  6. Traversing the tree will generate the code for you. Each time you go left, add a 0. Each time you go right, add a 1.

Now for the code…

Node

The first structure we need is a node. This is sort of an all-purpose node which I used in the queue (linked list) and the binary tree, so there are links to the left and right, next and previous, and the parent and child, depending on what the structure is being used for. This is Node.cpp:

struct node {

	int frequency;
	char value;
	struct node * parent;
	struct node * prev;
	struct node * next;
	struct node * lchild;
	struct node * rchild;

	node() {

		frequency = 0;
		value = 0;
		parent = 0;
		prev = 0;
		next = 0;
		lchild = 0;
		rchild = 0;

	}

};

Queue

Next, we have our Queue, which is used to sum the node values and generate the nodes of the binary tree. This structure includes some basic functionalities. This is Queue.h:

#include "Node.cpp"
#include <string>
#include <iostream>
#include <sstream>

using namespace std;

#ifndef QUEUE_H_
#define QUEUE_H_

class Queue
{
public:
	Queue();
	virtual ~Queue();
	struct node * head;
	struct node * tail;
	struct node * peek();
	void push( struct node * toQueue );
	void pushFirst( struct node * toQueue );
	struct node * pop();
	bool isEmpty();
	int size();
	string toString();
};

#endif /*QUEUE_H_*/

And the source file, Queue.cpp:

#include "Queue.h"

Queue::Queue()
{
	// Initialize head and tail

	head = new struct node;
	tail = new struct node;

	// Parent

	head->parent = NULL;
	tail->parent = NULL;

	// Children

	head->lchild = NULL;
	tail->lchild = NULL;

	head->rchild = NULL;
	tail->rchild = NULL;

	// Next

	head->next = tail;
	tail->next = NULL;

	// Prev

	head->prev = NULL;
	tail->prev = head;

	// Frequency

	head->frequency = -1;
	tail->frequency = -1;

	// Value

	head->value = '\0';
	tail->value = '\0';

}

Queue::~Queue()
{
	head = NULL;
	tail = NULL;
	delete head;
	delete tail;
}

struct node * Queue::peek()
{

	struct node * toPeek = head->next;
	return toPeek;

}

void Queue::push( struct node * toQueue )
{

	/**
	 * List is empty, so push this node
	 * to the first position in the
	 * queue.
	 */

	if ( isEmpty() ) {

		pushFirst( toQueue );

	} 

	/**
	 * List is not empty, so push this node
	 * to the last position before the
	 * tail.
	 */

	else {

		struct node * current = head->next;

		while ( current != tail ) {

			if ( current->next == tail ) {

				current->next = toQueue;
				toQueue->next = tail;		

			}

			current = current->next;

		} // end while loop

	} // end if

}

void Queue::pushFirst( struct node * toQueue )
{

	struct node * cursor = head->next;
	head->next = toQueue;
	toQueue->next = cursor;

}

struct node * Queue::pop()
{

	if ( !isEmpty() ) {

		struct node * toReturn = head->next; // Node to return

		/**
		 * The node we are dequeuing has a node
		 * after it. Let's make this head->next.
		 */

		if ( toReturn->next != tail ) {

			head->next = head->next->next;

		}

		/**
		 * There is no node after this one, meaning
		 * the queue is empty now. Let's make
		 * head->next the tail.
		 */

		else {

			head->next = tail;

		}

		return toReturn;

	}

	return NULL;

}

bool Queue::isEmpty()
{
	return size() == 0;
}

int Queue::size()
{

	// List is empty

	if ( head->next == tail ) return 0;

	// List isn't empty

	struct node * current = head->next;
	int length = 0;

	while ( current != tail ) {

		length++;
		current = current->next;

	}

	return length;
}

string Queue::toString()
{

	// List is empty

	if ( isEmpty() ) {

		return "[HEAD] - [TAIL]";

	}

	// List isn't empty

	string tostring = "[HEAD]";

	struct node * current = head->next;

	while ( current != tail ) {

		ostringstream o;
		o << current->frequency;
		tostring = tostring + " - [" + o.str() + ":" + current->value + "]";
		current = current->next;

	}

	tostring = tostring + " - [TAIL]";

	return tostring;
}

The Main Program

This code is a bit lengthy, so please bear with me. The main method is at the bottom of the code, and we are reading the file, parsing the document, and creating the Huffman code. The majority of the work is in the method huffman(). You will see that I’ve left many comments within the code, and some debugging lines are commented out. Should you encounter any difficulties, uncomment these lines and you will see more output in the console. To test this out with your own input, create a file called input.txt and save it in the root of this project. Run the program, and you should see your input file, the Huffman encoded message, the decoded message (which should be the same as the original input), and finally, the reverse encoded and decoded message.

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>
#include "Queue.h"

using namespace std;

int oldBits = 0;			// Number of Bits in Original Message
int newBits = 0;			// Number of Bits in Compressed Message
Queue * q1 = new Queue;			// External Nodes (Q1)
Queue * q2 = new Queue;			// Internal Nodes (Q2)
const int MAX_CHARS = 128;		// Number of ASCII chars
int frequency[ MAX_CHARS ];		// Array of int frequencies (for ASCII chars)
vector<string> text_file;		// Vector to store text file lines
vector<string> code_vector;		// Vector to store the code
ifstream ifs("input.txt");		// Text file to read

void readFile();
void parse();
void huffman();
string reverseString( string toReverse );
void decode( struct node * originalRoot, struct node * root, string code, unsigned int index );
void encode( struct node * root, string code, char value );
void print_inorder( struct node * root, string code );
void search( struct node * root, string code, char value );
void sortNodes( struct node array[], int left, int right );
void mergeNodes( struct node array[], int left, int middle, int right );
void sort( int array[], int left, int right );
void merge( int array[], int left, int middle, int right );
int main();

/**
 * This merge sorts the given array.
 *
 * @param int array[]
 * @param int left
 * @param int right
 */

void sort( int array[], int left, int right ) {

	if ( left >= right ) return;			// Only one element

	int middle = ( left + right ) / 2;		// Middle index
	sort( array, left, middle );			// Sort left side
	sort( array, middle + 1, right );		// Sort right side
	merge( array, left, middle, right );		// Merge the two

} // end method sort

/**
 * This merge sorts nodes in a
 * given array.
 *
 * @param int array[]
 * @param int left
 * @param int right
 */

void sortNodes( struct node array[], int left, int right ) {

	if ( left >= right ) return;				// Only one element

	int middle = ( left + right ) / 2;			// Middle index
	sortNodes( array, left, middle );			// Sort left side
	sortNodes( array, middle + 1, right );			// Sort right side
	mergeNodes( array, left, middle, right );		// Merge the two

} // end method sortNodes

/**
 * This merges the sorted data in the
 * array (it essentially also sorts it).
 *
 * @param int array[]
 * @param int left
 * @param int middle
 * @param int right
 */

void merge( int array[], int left, int middle, int right ) {

	if ( middle + 1 > right ) return; // List is empty

	int tempArray[ MAX_CHARS ];	// Initialize a temporary array

	/**
	 * Copy left side of array
	 * into temporary array
	 */

	for ( int i = left; i != middle + 1; i++ ) {

		tempArray[i] = array[i];

	} // end for loop

	/**
	 * Copy right side of array
	 * into temporary array
	 */

	for ( int i = middle + 1; i != right + 1; i++ ) {

		tempArray[i] = array[ right + middle + 1 - i ];

	} // end for loop

	/**
	 * Merge the two
	 */

	int l = left;	// Left side
	int r = right;	// Right side

	for ( int i = left; i != right + 1; i++ ) {

		/**
		 * Left <= Right, so copy Left
		 * into array
		 */

		if ( tempArray[l] <= tempArray[r] ) {

			array[i] = tempArray[l++];

		} 

		/**
		 * Copy Right into array
		 */

		else {

			array[i] = tempArray[r--];

		} // end if

	} // end for loop

	array = tempArray;	// Copy back into array

} // end method merge

/**
 * This merges the sorted data in the
 * array (it essentially also sorts it).
 *
 * @param int array[]
 * @param int left
 * @param int middle
 * @param int right
 */

void mergeNodes( struct node array[], int left, int middle, int right ) {

	if ( middle + 1 > right ) return; // List is empty

	struct node tempArray[ MAX_CHARS ];	// Initialize a temporary array

	/**
	 * Copy left side of array
	 * into temporary array
	 */

	for ( int i = left; i != middle + 1; i++ ) {

		tempArray[i] = array[i];

	} // end for loop

	/**
	 * Copy right side of array
	 * into temporary array
	 */

	for ( int i = middle + 1; i != right + 1; i++ ) {

		tempArray[i] = array[ right + middle + 1 - i ];

	} // end for loop

	/**
	 * Merge the two
	 */

	int l = left;	// Left side
	int r = right;	// Right side

	for ( int i = left; i != right + 1; i++ ) {

		/**
		 * Left <= Right, so copy Left
		 * into array
		 */

		if ( tempArray[l].frequency <= tempArray[r].frequency ) {

			array[i] = tempArray[l++];

		} 

		/**
		 * Copy Right into array
		 */

		else {

			array[i] = tempArray[r--];

		} // end if

	} // end for loop

	array = tempArray;	// Copy back into array

} // end method merge

/**
 * This reads a given text file and puts each
 * line in a vector.
 */

void readFile() {

	string line;

	cout << "---------------------------------------------------------------" << endl;
	cout << "Input File" << endl;
	cout << "---------------------------------------------------------------\n" << endl;

	while(getline(ifs,line)) {

		cout << line << endl;
		text_file.push_back( line );
		oldBits = oldBits + line.length();

	}

	cout << endl;

	oldBits = oldBits * 8;	// Assume every char is 8 bits

} // end method readFile

/**
 * This parses the input file, counts the frequency
 * of chars, and creates an array of those
 * frequencies.
 */

void parse() {

	// Initialize frequency array

	for( int i = 0; i < MAX_CHARS; i++ ) {

		frequency[ i ] = 0;

	}

	// Get ASCII Value for each char and increment its frequency
	vector<string>::iterator itVectorData;

	for( itVectorData = text_file.begin(); itVectorData != text_file.end(); itVectorData++ ) {

		string * toSearch = new string;
		*toSearch = *(itVectorData);

		for( unsigned int i = 0; i < toSearch->size(); i++ ) {

			char toChar = toSearch->at(i);
			int ascii = (int)toChar;

			// Only increment frequencies that are within range

			if( ascii >= 0 && ascii <= MAX_CHARS ) {

				frequency[ ascii ]++;

			}

		} // end for loop
	}

	// Output elements that have a frequency higher than zero

	int sum = 0;
	int count = 0;

	for( int i = 0; i < MAX_CHARS; i++ ) {

		if( frequency[ i ] > 0 ) {

			//cout << char(i) << " appears " << frequency[ i ] << " times." << endl;
			sum = sum + frequency[ i ];
			count++;

		}

	} // end for loop

	// Now make an array of elements > 0

	struct node tempArray2[ count ];
	int tempArray[ count ];
	int index = 0;

	for( int i = 0; i < MAX_CHARS; i++ ) {

		if( frequency[ i ] > 0 ) {

			struct node * n = new struct node;
			n->frequency = frequency[ i ];
			n->value = i;
			tempArray2[ index ] = *n;
			tempArray[ index ] = frequency[ i ];
			index++;

		}

	}

	// Sort the array
	sort( tempArray, 0, count-1 );
	sortNodes( tempArray2, 0, count-1 );

	// Put into queue

	for ( int i = 0; i < count; i++ ) {

		struct node * toPush = new struct node;
		toPush->frequency = tempArray2[ i ].frequency;
		toPush->value = tempArray2[ i ].value;

		q1->push( toPush );

	}

} // end method parse

/**
 * This method essentially builds the
 * huffman tree so we can encode and
 * decode data.
 */

void huffman() {

	while ( q1->size() > 0 ) {

		//cout << endl;

		struct node * n1 = new struct node;
		struct node * n2 = new struct node;
		struct node * toPush = new struct node;

		// Second queue is empty, so take first two
		// nodes from the first queue. No comparison
		// is necessary.

		if ( q2->size() == 0 ) {

			n1 = q1->pop();
			n2 = q1->pop();

		}

		// Second queue has node(s), so we must
		// compare to the first queue.

		else {

			/**
			 * If the size of both the queues are 1, then
			 * just pop them both.
			 */

			if ( q1->size() == 1 && q2->size() > 0 ) {

				n1 = q1->pop();
				n2 = q2->pop();

			} 

			/**
			 * Otherwise, we have to compare both of
			 * the nodes from Q1 and Q2.
			 */

			else {

				// Peek both nodes from Q1 and Q2

				struct node * temp1 = q1->peek();
				struct node * temp2 = q2->peek();

				// Q2 node is smaller (choose it)

				if ( temp1->frequency > temp2->frequency ) {

					n1 = q2->pop();

				}

				// Q1 node is smaller (choose it)

				else {

					n1 = q1->pop();

				}

				// Do this again for n2, but this
				// time we must check if Q1 is
				// empty.

				if ( !q1->isEmpty() ) {

					temp1 = q1->peek();
					temp2 = q2->peek();

					// Q2 node is smaller (choose it)

					if ( temp1->frequency > temp2->frequency ) {

						n2 = q2->pop();

					}

					// Q1 node is smaller (choose it)

					else {

						n2 = q1->pop();

					}

				} 

				// Q1 has been emptied, so get the next two
				// nodes from Q2 instead.

				else {

					// Only take two nodes from Q2 if
					// we have two nodes to take

					if ( q2->size() > 1 ) {

						n1 = q2->pop();
						n2 = q2->pop();

					} 

					// Q2 has one node left, so break.

					else {

						break;

					}

				}

			}

		} // end if

		toPush->frequency = n1->frequency + n2->frequency;	// Sum of frequencies
		toPush->lchild = n1;								// Left child
		toPush->rchild = n2;								// Right child
		toPush->value = '\0';								// Char value
		n1->parent = toPush;								// Parent of N1
		n2->parent = toPush;								// Parent of N2

		// Push sum of these nodes into Q2

		q2->push( toPush );

		//cout << "N1: " << n1->frequency << " N2: " << n2->frequency << " toPush: " << toPush->frequency << endl;

		//cout << q1->toString() << endl;
		//cout << q2->toString() << endl;

	} // end while loop

	while ( q2->size() > 1 ) {

		//cout << endl;

		struct node * n1 = new struct node;
		struct node * n2 = new struct node;

		n1 = q2->pop();
		n2 = q2->pop();

		struct node * toPush = new struct node;
		toPush->frequency = n1->frequency + n2->frequency;	// Sum of frequencies
		toPush->lchild = n1;								// Left child
		toPush->rchild = n2;								// Right child
		toPush->value = '\0';								// Char value
		n1->parent = toPush;								// Parent of N1
		n2->parent = toPush;								// Parent of N2

		// Push sum of these nodes into Q2

		q2->push( toPush );

		//cout << "N1: " << n1->frequency << " N2: " << n2->frequency << " toPush: " << toPush->frequency << endl;

		//cout << q1->toString() << endl;
		//cout << q2->toString() << endl;

	}

	// Make the last node the root

	struct node * root = new struct node;

	if ( q2->size() == 1 ) {

		root = q2->pop();

		//cout << "\nRoot: " << root->frequency << endl;

	}

	//cout << q1->toString() << endl;
	//cout << q2->toString() << endl;

	/**
	 * Encode the string
	 */

	cout << "---------------------------------------------------------------" << endl;
	cout << "Encoding..." << endl;
	cout << "---------------------------------------------------------------\n" << endl;

	cout << "INPUT:\t";

	vector<string>::iterator itVectorData;

	// Print out what we are encoding

	for( itVectorData = text_file.begin(); itVectorData != text_file.end(); itVectorData++ ) {

		string * toSearch = new string;
		*toSearch = *(itVectorData);

		cout << *toSearch;

	}

	cout << "\n\nOUTPUT:\t";

	// Encode it

	for( itVectorData = text_file.begin(); itVectorData != text_file.end(); itVectorData++ ) {

		string * toSearch = new string;
		*toSearch = *(itVectorData);

		for( unsigned int i = 0; i < toSearch->size(); i++ ) {

			encode( root, "", toSearch->at(i) );

		}

	}

	// Print the code

	string thecode;

	for( itVectorData = code_vector.begin(); itVectorData != code_vector.end(); itVectorData++ ) {

		string * code = new string;
		*code = *(itVectorData);

		//cout << *code;

		thecode = thecode + *code;

	}

	newBits = thecode.length();

	cout << thecode << endl;

	cout << "\n---------------------------------------------------------------" << endl;
	cout << "Huffman Codes" << endl;
	cout << "---------------------------------------------------------------\n" << endl;
	print_inorder( root, "" );
	cout << endl;

	/**
	 * Compression Ratio
	 */

	cout << "---------------------------------------------------------------" << endl;
	cout << "Compression Ratio" << endl;
	cout << "---------------------------------------------------------------\n" << endl;
	cout << "Number of Bits in Uncompressed Message:\t" << oldBits << endl;
	cout << "Number of Bits in Compressed Message:\t" << newBits << endl;

	float saved = oldBits - newBits;
	cout << "Number of Bits Saved:\t\t\t" << saved << endl;

	float ratio = ( saved / oldBits );
	cout << "Compression Ratio (value):\t\t" << ratio << endl;

	float percentage = ratio * 100;
	cout << "Compression Ratio (percent):\t\t" << percentage << " %" << endl;

	cout << endl;	

	/**

	 * Decode
	 */

	cout << "---------------------------------------------------------------" << endl;
	cout << "Decoding the Encoded Message..." << endl;
	cout << "---------------------------------------------------------------\n" << endl;

	cout << "INPUT:\t";
	cout << thecode << endl;

	cout << "\nOUTPUT:\t";
	decode( root, root, thecode, 0);

	/**
	 * Decode the reverse (this will be
	 * gibberish)
	 */

	cout << "\n\n---------------------------------------------------------------" << endl;
	cout << "Decoding the Reverse Encoded Message..." << endl;
	cout << "---------------------------------------------------------------\n" << endl;

	string reverse = reverseString( thecode );
	cout << "INPUT:\t" << reverse << endl;
	cout << "\nOUTPUT:\t";
	decode( root, root, reverse, 0 );

} // end method huffman

/**
 * This method prints the values of the
 * tree elements inorder.
 */

void print_inorder( struct node * root, string code ) {

	if ( root != NULL ) {

		print_inorder( root->lchild, code + "0" );	// print left subtree

		if ( root->value != '\0' ) {

			cout << "Character: " << root->value << "\tFrequency: " << root->frequency << "\tCode: " << code << endl;	// print this node

		}

		print_inorder( root->rchild, code + "1" );	// print right subtree

	}

} // end method print_inorder

/**
 * This method searches for a given element (char) in
 * the huffman tree.
 */

void search( struct node * root, string code, char value ) {

	if ( root != NULL ) {

		search( root->lchild, code + "0", value );	// search left subtree

		if ( root->value != '\0' ) {

			if ( root->value == value ) {

				cout << "Val: " << root->value << " Code: " << code << endl;	// print this node

			}

		}

		search( root->rchild, code + "1", value );	// search right subtree

	}

} // end method search

/**
 * This method encodes the data in the vector
 * and prints the huffman code.
 */

void encode( struct node * root, string code, char value ) {

	if ( root != NULL ) {

		encode( root->lchild, code + "0", value );	// print left subtree

		if ( root->value != '\0' ) {

			if ( root->value == value ) {

				//cout << code;					// print this node
				code_vector.push_back( code );	// store in vector

			}

		}

		encode( root->rchild, code + "1", value );	// print right subtree

	}

} // end method encode

/**
 * This method decodes a given code and gives the
 * string representation.
 */

void decode( struct node * originalRoot, struct node * root, string code, unsigned int index ) {

	// This will break out of the decode if
	// the last few chars don't exist

	if ( index >= code.length() ) return;

	// Only do this if the root isn't null

	if ( root != NULL ) {

		// We've decoded the char

		if ( root->value != '\0' ) {

			cout << root->value << flush;

			if ( code.length() != index ) {

				string c = code.substr( index, code.length() );
				decode( originalRoot, originalRoot, c, 0 );

			}

		}

		// Traverse

		else {

			// Go left

			if ( code.at( index ) == '0' ) {

				decode( originalRoot, root->lchild, code, ++index );

			}

			// Go right

			else {

				decode( originalRoot, root->rchild, code, ++index );

			}

		} // end if

	} // end if

} // end method decode

string reverseString( string toReverse ) {

	string s( toReverse.begin(), toReverse.end() );

	// iterate through all of the characters

	string::iterator pos;

	for (pos = s.begin(); pos != s.end(); ++pos) {

		//cout << *pos;

	}

	//cout << endl;

	reverse( s.begin(), s.end() );

	return s;

}

/**
 * Main method
 */

int main() {

	readFile();
	parse();
	huffman();

	return 0;

}

Questions, comments, suggestions? Leave them in the comments.

Tags: , ,

Leave a Reply