Letās take a look at the compression algorithm behind UnixāsĀ compressĀ and most .gif compression.
The Lempel Ziv Welch [LZW] algorithm is a greedy lossless compression algorithm that works by replacing recurring patterns with shorter codes in order to save space.
Lossless is a form of compression where no data is lost.
Weāll go over the algorithm and take a look at an implementation in Python.
For simplicity's sake, weāll limit the discussion to encoding ASCII characters.
Encoding
Essentially, if as weāre processing a text file, weāre able to identify recurring phrases, we can substitute those phrases with a code that would be shorter than the original text. Doing this over the entire document gives us the compression weāre after.
LZWĀ relies on a dictionary that maps a phrase to a unique code. The first 256 entries in the dictionary are the single ASCII characters mapped to their numeric code.
For example, at the start, the dictionary might look something like this:
{
"A": "65",
"B": "66",
....
"a": "97",
"b": "98"
}
Now, weāll iterate through the rest of the text ā character by character ā building up a phrase / character sequence. If the inclusion of a new character creates a phrase that doesnāt currently exist in our dictionary, we will ignore this breaking character and output the code that matches the previous state of our character sequence.
Then, weāll add this new phrase with this breaking character into our dictionary and assign it a new code.
For example, letās say our input text is:
... app [cursor] appl ....
Our dictionary might look something like this:
{
...
"app": 324,
....
}
The next portion of text we come across is āapplā which isnāt currently in our dictionary.
So, weāll output the code for the existing phrase match of āappā and then weāll add āapplā to our dictionary with a new code.
The output after this iteration might look something like this:
Dictionary: { ā¦ āappā: 324, āapplā: 325 ā¦}
Output: ā¦ 324 ā¦
Notice that we only output a code when we encounter a new phrase that isnāt already in our dictionary. This gives us a chance to build up to larger phrases and achieve greater compression.
As the program processes the input and the dictionary grows, weāll eventually be storing larger and larger strings. The longer the string, the more compression we can get by replacing it with a smaller code instead.
Unlike Huffman Encoding, LZW doesnāt require this dictionary to be included in the compressed data. One of the unique characteristics of this algorithm is that this dictionary can be rebuilt while uncompressing the data. Additionally, given the algorithmās character by character approach, itās well-suited for compressing streams of data.
To fully understand where the compression comes from, letās take a look at a larger example:
34832 : 'Harry'
34833 : 'Potter'
34834 : 'Hogwarts'
34835 : 'magic'
If we save the codeĀ 34833Ā as an unsigned short, for example, itāll costĀ 16 bitsĀ . However, itāll let us representĀ 6 characters * 8 bits = 48 bitsĀ worth of information.
Generally speaking, the higher the codes climb the more compression we will obtain.
Decoding
The decoding process is really just the opposite of the encoding process. The only change is that we need to re-create the dictionary with the 256 ASCII codes as a preliminary step before the rest of the decompression can begin.
Then, we read through the compressed data and use our code dictionary to translate the codes back to the original text.
Variable-Width Encoding
Most implementations of this algorithm use the idea of aĀ variable width code.
This defines an upper-bound for the maximum length of the code ā thereby preventing this algorithm from running away with memory. For example, if we chose a 8-bit code as our width, our maximum number of codes would be 256 (2āø).
If we chose a 12-bit code as our width, we could have up to 4096 codes and so on.
If our dictionary grew to include all 4096 codes, the rest of the document is encoded using the existing dictionary, but no new codes are added.
Code
The compression algorithm is fairly simple to implement.
The following implementation when run onĀ A Tale of Two Cities, results in a compressed file that is 532 KB which is 32% smaller than the original at 777 KB.
The implementation is an adapted version ofĀ this code:
Encoder
Decoder
Wrapping Up
This algorithm, though easy enough to understand and implement, isnāt always the right choice. It performs poorly when the text is short and unique; the algorithm favors redundant data.
If you like this technical dive into modern algorithms, feel free to follow me on Medium,Ā Twitter, or check out my personal site for more posts.
Also published here0