- Ask the user if they want to compress of decompress a file.
- Get input file path.
- Get a path to save the new compressed or decompressed file.
- And then exit. Your program should not remain open between compression and decompression sessions.
This program is an implementation of the Huffman compression algorithm, which is used to compress and decompress text files. The algorithm works by constructing a frequency table of characters in the input file, then builds a binary tree to create a dictionary for comparing the contents to be compressed based on the frequency table. There are mainly two parts to the function, one is compression and another is decompression. The program in the compression includes BuildTree and BuildDictionary() and write to file; the part of the decompression includes retrieving the header to build the frequency table and using the table to build the tree to decode the compressed file.
In the compression, The program first calls the BuildFrequencyTable(const string& filePath), which reads the input file character by character and builds the frequency table by using the map and adds a PSEUDO_EOF in the frequency table to prepare for the decompression. The first time I did this part was using the vector to push_back the node that has been initialized to be the frequency table. But that method is inefficient, taking up a large memory, and reading and searching speed takes O(n) because it cannot use a binary search algorithm that needs to be sorted pattern. So I go searching for the more efficient container - the map. The map is built on the red-black tree which may take O(logn) to search and large memory usage to store the pointer. Then I plan to choose the unordered_map which is built on hash_map that the search may take O(1) in the best case. But after I changed to store the frequency table in the header I find if I use unordered_map the order is random so if I use the frequency table retrieving from the header to build the tree will not match the original version. So for multiple reasons listed ahead, I chose to use the map. After the frequency table, the BuildTree function creates a priority queue of nodes, where each node represents a character and its frequency in the input file. It will repeatedly dequeue the two nodes from the priority queue, creates a parent node for them, and enqueues the parent node back into the priority queue. This process is repeated until there is only one node left in the queue, which represents the root of the Huffman tree. Finally, it updates the treeRoot member variable to point to the root of the Huffman tree. Then the BuildDictionary(Node* tree, const string& prefix) function will build a dictionary of codes for each character in the input file based on the Huffman tree generated in the first part of the algorithm. It takes as input a pointer to the root of the Huffman tree and a prefix string. If the current node is a leaf node, it will add the prefix string and the corresponding character to the Dict member variable, which is a map that maps each character to its Huffman code. If the current node is not a leaf, it recursively calls the BuildDictionary function for its left child (adding a '0' to the prefix string) and for its right child (adding a '1' to the prefix string). After everything is constructed completely, it will be written to the file by using the storage class. The writing to file part first store the frequency table to the file and insert the content character by character compared to the content of map<char, string> Dict. In the set header part, my first time was using the delimiter "," to separate the letter and Huffman code. and tried to use the delimiter to initialize back to the node. Then I find the delimiter will conflict with the comma in the text file. Then I used "@" to split up as an alternative. But I realized that if the file contains the "@" that will break the rule of the game. Then I was inspired by the storage class, I converted the dictionary to Binary format and use ' ' and /n as the delimiter, perfectly solving the problem. Then the compress function will write the pre-arranged PSEUDO_EOF to the end of the file, which is a flag that marks the end of the file and will stop the decompression file from continuing to read the file, due to the file being written as the binary pattern 8 bits at a time, and the file is not likely to always proportional to the 8.
In the decompression part, the program reads the header file back and push to the map to build the frequency table and then build the tree again by using the frequency table. After the Huffman tree is built, the function reads the compressed message from the input file chunk by chunk. The chunks are concatenated to any remaining bits from the previous chunk that were not decoded into a character. The concatenated result is then decoded character by character by traversing the Huffman tree. The traversal of the Huffman tree is done by following a path in the tree based on the value of each bit in the compressed message. If the current bit is '0', it follows the left child of the current node; if it is '1', then follow the right child. If the traversal reaches a leaf node, the character associated with that leaf node is written to the output file. If the traversal reaches the PSEUDO_EOF character, meaning that the end of the compressed message has been reached, and the function stops reading from the file by the storage. The first version of my decompression is to store the reversed_map of the dictionary. But my dear professor Paul said it cannot be used in such a tricky way. Then I compromised to use this educational way to build the tree.
In the part of optimization, I changed the place that needs to concatenate the string to use string stream to handle, which may reduce the time to reallocate memory. I build the Big Three in C++ and use delete after the dynamically allocated variables that don't need to use anymore to handle the memory usage. Also, I tried to avoid nested loops to increase time efficiency. But in the decompression Part, it is excusable to sacrifice time complexity for memory efficiency. Additionally, I used const and & if it does not need to change the value to avoid duplicate copies.
To use this program, compile it and then run the executable. The user will be prompted to choose between compression (option 1) and decompression (option 2). Next, provide the input file path and the output file path. The program will perform the chosen operation and save the result to the specified output file.
The compression rate varies depending on the input file's content and the size of the original file. When compressing a large text file with more than 1MB of English text, the program compressed it to around 50% - 60% of its original size. In addition, the actual compression rate depends on the input file's content, since the Huffman algorithm is more efficient with files containing more repetitive or predictable data (the most repetitive character will have less length in building dictionary).
However, if the file is smaller than 1 MB, the compression effect is not ideal. Because it may take up more space to store the decode information, such as the file gogogophers.




