disast.rs :: journal

DEFLATE

"A Massively Spiffy Yet Delicately Unobtrusive Compression Library"

The core of zlib is the "deflate" algorithm, which applies:

  1. LZ77 Prefix string compression (using a variable dictionary size)
  2. Huffman coding (using a pre-defined tree for small files, else a dynamic tree)

Deflate stream structure

Deflate converts a sequence of uncompressed bytes (the data) into a sequence of structured blocks. We will call this sequence of blocks the "deflate stream"

Each block has a 3-bit header with two fields:

  1. BFINAL: 1 if this block is last in the sequence, else 0
  2. BTYPE: indicates compression type.
    1. 00: No compression. Any bits up to the next byte boundary are ignored. The rest of the block consists of a 2-byte length field LEN, a 2-byte field containing one's complement of LEN, and LEN bytes of uncompressed data.
    2. 01: A static Huffman compressed block, using a tree defined in the RFC
    3. 10: A dynamic Huffman compressed block, with the huffman table supplied immediately following the header

Prefix match compression

 
 Consider a stream of text:

readlineslinesread Notice that there's two duplicate sequences: vvvv,,,, readlinelineread ''''^^^^ 0123456789abcdef We can condense that down by saying: readline{length: 4, distance: 4}{length: 4, distance: 12} 01234567 8 9 a b Restated in plain english: The literal character sequence 'readline', followed by two symbols indicating a reference with length 4 beginning 4 positions prior, followed by another pair of symbols indicating a reference of length 4 beginning 12 positions prior. How do we achieve this? Let's make some decisions. First, let's decide that for matches of 1 or 2 symbols, we'll just keep the original symbol. Second, let's decide that we might look back 1-512 positions (9 bits) for sequences between 3-258 (a range of 256, or 8 bits) symbols long. And, let's decide to use a stop-code to indicate the end of this compressed stream. One hypothetical approach to this could use 9 bits to represent chars or match lengths, followed by a 9-bit "distance" field that is guaranteed to only follow match lengths: 0 0000 0000 :: Literal 0, ascii null 0 0111 1111 :: Literal 127, ascii DEL 0 1111 1111 :: Literal 256 1 0000 0000 :: Stop code 1 0000 0001 :: Match length 3 1 0000 1111 :: Match length 17 1 1111 1111 :: Match length 257 What does that look like for our earlier example? 0 0111 0010 :: 'r' 0 0110 0101 :: 'e' 0 0111 0001 :: 'a' 0 0111 0100 :: 'd' 0 0110 1100 :: 'l' 0 0110 1001 :: 'i' 0 0110 1110 :: 'n' 0 0110 0101 :: 'e' 1 0000 0010 :: Match length 4 0 0000 0011 :: Distance 4 1 0000 0010 :: Match length 4 0 0000 1011 :: Distance 12 Compared to before, that's 12 symbols x 9 width = 108 bits, which is not too much better than the raw ascii representation of 16 bytes = 128 bits. However, the string: . . . . . . . readlinelinereadlinereadline (24 bytes, 192 bits) Has the same compressed length as before: readline{length 4, distance 4}{length: 16, distance 12} (12 symbols, 108 bits) And we can cope with repeats as well: . . . . . . . . readlinelinelinelinelinelinereadline Compreses to: readline{length: 16, distance 4}{length 8, distance 28}

What about Deflate?

Deflate uses a combination of prefix matching, combined with huffman trees, to get further than the earlier 9-bit symbol format. It also allows for distances up to 32768 (15-bits) rather than our mere 512 bytes.

 
 For literal symbol/distance combinations, we have 288 discrete bit combinations to play with: 
 
 0 0000 0000 :: 0, Literal value for null 
 0 0111 0010 :: 114, Literal value for 'r' 
 0 0100 0001 :: 101, Literal value for 'e' 
 ... 
 1 0000 0000 :: 256, Stop code 
 1 0000 0001 :: 257, Match length of 3 
 1 0001 1101 :: 285, Match length of 258 
 

Links