DEFLATE
"A Massively Spiffy Yet Delicately Unobtrusive Compression Library"
The core of zlib is the "deflate" algorithm, which applies:
- LZ77 Prefix string compression (using a variable dictionary size)
- Huffman coding (using a pre-defined tree for small files, else a dynamic tree)
Deflate stream structure
Deflate converts a sequence of uncompressed bytes (the data) into a sequence of structured blocks. We will call this sequence of blocks the "deflate stream"
Each block has a 3-bit header with two fields:
- BFINAL:
1if this block is last in the sequence, else0 - BTYPE: indicates compression type.
00: No compression. Any bits up to the next byte boundary are ignored. The rest of the block consists of a 2-byte length fieldLEN, a 2-byte field containing one's complement ofLEN, andLENbytes of uncompressed data.01: A static Huffman compressed block, using a tree defined in the RFC10: A dynamic Huffman compressed block, with the huffman table supplied immediately following the header
Prefix match compression
Consider a stream of text:
readlineslinesread
Notice that there's two duplicate sequences:
vvvv,,,,
readlinelineread
''''^^^^
0123456789abcdef
We can condense that down by saying:
readline{length: 4, distance: 4}{length: 4, distance: 12}
01234567 8 9 a b
Restated in plain english:
The literal character sequence 'readline', followed by two symbols indicating a reference
with length 4 beginning 4 positions prior, followed by another pair of symbols indicating
a reference of length 4 beginning 12 positions prior.
How do we achieve this? Let's make some decisions.
First, let's decide that for matches of 1 or 2 symbols, we'll just keep the original symbol.
Second, let's decide that we might look back 1-512 positions (9 bits) for sequences
between 3-258 (a range of 256, or 8 bits) symbols long. And, let's decide to use a stop-code
to indicate the end of this compressed stream.
One hypothetical approach to this could use 9 bits to represent chars or match lengths, followed
by a 9-bit "distance" field that is guaranteed to only follow match lengths:
0 0000 0000 :: Literal 0, ascii null
0 0111 1111 :: Literal 127, ascii DEL
0 1111 1111 :: Literal 256
1 0000 0000 :: Stop code
1 0000 0001 :: Match length 3
1 0000 1111 :: Match length 17
1 1111 1111 :: Match length 257
What does that look like for our earlier example?
0 0111 0010 :: 'r'
0 0110 0101 :: 'e'
0 0111 0001 :: 'a'
0 0111 0100 :: 'd'
0 0110 1100 :: 'l'
0 0110 1001 :: 'i'
0 0110 1110 :: 'n'
0 0110 0101 :: 'e'
1 0000 0010 :: Match length 4
0 0000 0011 :: Distance 4
1 0000 0010 :: Match length 4
0 0000 1011 :: Distance 12
Compared to before, that's 12 symbols x 9 width = 108 bits, which is not too much better than
the raw ascii representation of 16 bytes = 128 bits. However, the string:
. . . . . . .
readlinelinereadlinereadline (24 bytes, 192 bits)
Has the same compressed length as before:
readline{length 4, distance 4}{length: 16, distance 12} (12 symbols, 108 bits)
And we can cope with repeats as well:
. . . . . . . .
readlinelinelinelinelinelinereadline
Compreses to:
readline{length: 16, distance 4}{length 8, distance 28}
What about Deflate?
Deflate uses a combination of prefix matching, combined with huffman trees, to get further than the earlier 9-bit symbol format. It also allows for distances up to 32768 (15-bits) rather than our mere 512 bytes.
For literal symbol/distance combinations, we have 288 discrete bit combinations to play with: 0 0000 0000 :: 0, Literal value for null 0 0111 0010 :: 114, Literal value for 'r' 0 0100 0001 :: 101, Literal value for 'e' ... 1 0000 0000 :: 256, Stop code 1 0000 0001 :: 257, Match length of 3 1 0001 1101 :: 285, Match length of 258