LZ77 and LZ78 Facts for Kids

LZ77 and LZ78 are two clever ways to make computer files smaller. They are lossless data compression algorithms. This means they shrink files without losing any information.

These methods were created by two smart people, Abraham Lempel and Jacob Ziv. They published their ideas in 1977 and 1978. Because of this, the methods are called LZ77 and LZ78. Sometimes, people also call them LZ1 and LZ2.

Many other popular compression methods are based on LZ77 and LZ78. These include LZW, LZSS, and LZMA. These algorithms also help make common file types smaller. For example, they are used in GIF images and the DEFLATE algorithm. DEFLATE is used for PNG images and ZIP files.

Both LZ77 and LZ78 are like "dictionary coders." This means they build a kind of dictionary of repeated parts of the data. LZ77 uses a "sliding window" to find repeated parts. LZ78 builds a clear dictionary as it goes.

In 2004, these algorithms were named an IEEE Milestone. This is a special honor for important inventions. In 2021, Jacob Ziv received the IEEE Medal of Honor for his work on them.

How LZ77 Works to Shrink Files
How LZ78 Works to Shrink Files
- Expanding LZ78 Compressed Files
LZW: A Popular LZ78 Cousin
See also

How LZ77 Works to Shrink Files

LZ77 algorithms make files smaller by finding parts of the data that repeat. When they find a repeated part, they don't save it again. Instead, they save a short note that says, "This part is the same as that part you saw earlier."

This note is called a length-distance pair. It tells the computer two things:

Length: How many characters are repeated.
Distance: How far back in the file to look for the original characters.

Imagine you have the text "banana banana". Instead of writing "banana" twice, LZ77 might say: "banana" (first time), then "go back 7 characters and copy 6 characters." This saves space!

The Sliding Window Explained

To find these repeated parts, the computer needs to remember some of the most recent data. This remembered data is kept in a special area called a sliding window. Think of it like a small notepad that slides along the text. It only holds the last few kilobytes of information.

The size of this window can change. It might be 2 kilobytes, 4 kilobytes, or even 32 kilobytes. The bigger the window, the more data the computer can look back at to find matches. Both the computer that shrinks the file (encoder) and the computer that expands it (decoder) need to use this sliding window.

Sometimes, the "length" of a repeated part can be longer than the "distance" back to it. This might sound strange. For example, "go back 4 characters and copy 10 characters." How can you copy 10 if only 4 are there?

It works because the computer copies one character at a time. As it copies, the new characters become available to be copied again. This lets the algorithm repeat a small pattern many times. It's a clever way to do run-length encoding. This is useful for data with long strings of the same character, like "AAAAAAA".

Simple Steps for LZ77 Compression

Here's a simplified way to think about how LZ77 compresses data:

While there is still text to compress:

* Look for the longest repeated part of the text. This repeated part must start inside the "sliding window." * If a repeated part is found: * Note its distance (how far back it is). * Note its length (how many characters it has). * Note the character that comes right after this repeated part. * Else (if no repeated part is found): * Distance is 0. * Length is 0. * The character is just the very first character of the text you are looking at. * Output these three pieces of information: (distance, length, character). * Remove the characters you just processed (length + 1 characters) from the front of your "sliding window." * Add the same characters to the back of your "sliding window."

Repeat until all text is compressed.

Different Ways LZ77 is Used

Even though all LZ77 algorithms use the same basic idea, they can be set up differently. They might change how they save the length-distance pairs. They also decide how to handle "literals." Literals are raw data that isn't part of a repeated pattern.

Here are some examples:

The original LZ77 method from 1977 always output three values: length, distance, and the character that followed the match. If no match was found, the length was 0.
LZSS is an improved version. It uses a special 1-bit flag. This flag tells the computer if the next part of the data is a literal or a length-distance pair. It uses literals if a length-distance pair would make the file bigger.
The PalmDoc format uses two bytes for a length-distance pair. 11 bits are for distance, 3 bits for length, and 2 bits help the decoder know it's a pair.
Many games by Electronic Arts use a method where the first byte of a length-distance pair tells you how long the whole pair is (1 to 4 bytes).
As of 2008, DEFLATE is a very popular LZ77 method. It combines LZSS with another method called Huffman coding. This makes it very efficient.

How LZ78 Works to Shrink Files

The LZ78 algorithms compress data by building a dictionary of common sequences of characters. As the computer reads the data, it adds new sequences it sees to this dictionary. If it sees a sequence that's already in its dictionary, it just saves a reference to that dictionary entry instead of saving the whole sequence again.

This dictionary is like a tree structure. Each entry in the dictionary points to an earlier entry and adds one new character. This makes a unique sequence. The algorithm is "greedy," meaning it only adds something new to the dictionary when it finds a character that doesn't match an existing sequence.

Let's look at an example with the text "AABBA": 1. The computer starts. Dictionary is empty. 2. It sees "A". "A" is new. It adds "A" to the dictionary (let's say entry #1). It outputs "0A" (meaning "nothing before, then A"). 3. It sees "AA". "A" is in the dictionary. It then sees "B". "AB" is new. It adds "AB" to the dictionary (entry #2). It outputs "1B" (meaning "entry #1 (A), then B"). 4. It sees "B". "B" is new. It adds "B" to the dictionary (entry #3). It outputs "0B" (meaning "nothing before, then B"). 5. The text ends.

So, for "AABBA", the dictionary might look like this:

Entry 1: A
Entry 2: AB
Entry 3: B

And the compressed output would be something like "0A1B0B". Notice that the last "A" in "AABBA" isn't directly represented. Usually, an "end of file" marker is added to make sure everything is processed.

Expanding LZ78 Compressed Files

To decompress an LZ78 file, the computer rebuilds the dictionary as it reads the compressed data. If the compressed data is "0A1B0B": 1. It reads "0A". It knows "0" means nothing before, so it adds "A" to its dictionary (entry #1) and outputs "A". 2. It reads "1B". It looks up entry #1 in its dictionary, which is "A". Then it adds "B". So, it forms "AB". It adds "AB" to its dictionary (entry #2) and outputs "AB". 3. It reads "0B". It knows "0" means nothing before, so it adds "B" to its dictionary (entry #3) and outputs "B".

By doing this, the original text "AABBA" is perfectly recreated!

LZW: A Popular LZ78 Cousin

LZW is a very famous algorithm based on LZ78. It's used in many places, like for GIF images.

The main difference with LZW is that its dictionary starts with all possible single characters already in it. This means the computer always has a starting point. When LZW doesn't find a match, it assumes the current character is the start of a new sequence. It then just outputs the index of the last matching sequence it found.

For more details, you can check out the LZW article.

BTLZ is another LZ78-based algorithm. It was made for real-time communication, like modems. When its dictionary gets full, it has a simple way to make space. It finds an old, unused entry and deletes it to make room for a new one. This helps the dictionary keep up with changing data.

LZ77 and LZ78 facts for kids

Contents