1 |
|
---|
2 |
|
---|
3 |
|
---|
4 |
|
---|
5 |
|
---|
6 |
|
---|
7 | Network Working Group P. Deutsch
|
---|
8 | Request for Comments: 1951 Aladdin Enterprises
|
---|
9 | Category: Informational May 1996
|
---|
10 |
|
---|
11 |
|
---|
12 | DEFLATE Compressed Data Format Specification version 1.3
|
---|
13 |
|
---|
14 | Status of This Memo
|
---|
15 |
|
---|
16 | This memo provides information for the Internet community. This memo
|
---|
17 | does not specify an Internet standard of any kind. Distribution of
|
---|
18 | this memo is unlimited.
|
---|
19 |
|
---|
20 | IESG Note:
|
---|
21 |
|
---|
22 | The IESG takes no position on the validity of any Intellectual
|
---|
23 | Property Rights statements contained in this document.
|
---|
24 |
|
---|
25 | Notices
|
---|
26 |
|
---|
27 | Copyright (c) 1996 L. Peter Deutsch
|
---|
28 |
|
---|
29 | Permission is granted to copy and distribute this document for any
|
---|
30 | purpose and without charge, including translations into other
|
---|
31 | languages and incorporation into compilations, provided that the
|
---|
32 | copyright notice and this notice are preserved, and that any
|
---|
33 | substantive changes or deletions from the original are clearly
|
---|
34 | marked.
|
---|
35 |
|
---|
36 | A pointer to the latest version of this and related documentation in
|
---|
37 | HTML format can be found at the URL
|
---|
38 | <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
|
---|
39 |
|
---|
40 | Abstract
|
---|
41 |
|
---|
42 | This specification defines a lossless compressed data format that
|
---|
43 | compresses data using a combination of the LZ77 algorithm and Huffman
|
---|
44 | coding, with efficiency comparable to the best currently available
|
---|
45 | general-purpose compression methods. The data can be produced or
|
---|
46 | consumed, even for an arbitrarily long sequentially presented input
|
---|
47 | data stream, using only an a priori bounded amount of intermediate
|
---|
48 | storage. The format can be implemented readily in a manner not
|
---|
49 | covered by patents.
|
---|
50 |
|
---|
51 |
|
---|
52 |
|
---|
53 |
|
---|
54 |
|
---|
55 |
|
---|
56 |
|
---|
57 |
|
---|
58 | Deutsch Informational [Page 1]
|
---|
59 | |
---|
60 |
|
---|
61 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
62 |
|
---|
63 |
|
---|
64 | Table of Contents
|
---|
65 |
|
---|
66 | 1. Introduction ................................................... 2
|
---|
67 | 1.1. Purpose ................................................... 2
|
---|
68 | 1.2. Intended audience ......................................... 3
|
---|
69 | 1.3. Scope ..................................................... 3
|
---|
70 | 1.4. Compliance ................................................ 3
|
---|
71 | 1.5. Definitions of terms and conventions used ................ 3
|
---|
72 | 1.6. Changes from previous versions ............................ 4
|
---|
73 | 2. Compressed representation overview ............................. 4
|
---|
74 | 3. Detailed specification ......................................... 5
|
---|
75 | 3.1. Overall conventions ....................................... 5
|
---|
76 | 3.1.1. Packing into bytes .................................. 5
|
---|
77 | 3.2. Compressed block format ................................... 6
|
---|
78 | 3.2.1. Synopsis of prefix and Huffman coding ............... 6
|
---|
79 | 3.2.2. Use of Huffman coding in the "deflate" format ....... 7
|
---|
80 | 3.2.3. Details of block format ............................. 9
|
---|
81 | 3.2.4. Non-compressed blocks (BTYPE=00) ................... 11
|
---|
82 | 3.2.5. Compressed blocks (length and distance codes) ...... 11
|
---|
83 | 3.2.6. Compression with fixed Huffman codes (BTYPE=01) .... 12
|
---|
84 | 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) .. 13
|
---|
85 | 3.3. Compliance ............................................... 14
|
---|
86 | 4. Compression algorithm details ................................. 14
|
---|
87 | 5. References .................................................... 16
|
---|
88 | 6. Security Considerations ....................................... 16
|
---|
89 | 7. Source code ................................................... 16
|
---|
90 | 8. Acknowledgements .............................................. 16
|
---|
91 | 9. Author's Address .............................................. 17
|
---|
92 |
|
---|
93 | 1. Introduction
|
---|
94 |
|
---|
95 | 1.1. Purpose
|
---|
96 |
|
---|
97 | The purpose of this specification is to define a lossless
|
---|
98 | compressed data format that:
|
---|
99 | * Is independent of CPU type, operating system, file system,
|
---|
100 | and character set, and hence can be used for interchange;
|
---|
101 | * Can be produced or consumed, even for an arbitrarily long
|
---|
102 | sequentially presented input data stream, using only an a
|
---|
103 | priori bounded amount of intermediate storage, and hence
|
---|
104 | can be used in data communications or similar structures
|
---|
105 | such as Unix filters;
|
---|
106 | * Compresses data with efficiency comparable to the best
|
---|
107 | currently available general-purpose compression methods,
|
---|
108 | and in particular considerably better than the "compress"
|
---|
109 | program;
|
---|
110 | * Can be implemented readily in a manner not covered by
|
---|
111 | patents, and hence can be practiced freely;
|
---|
112 |
|
---|
113 |
|
---|
114 |
|
---|
115 | Deutsch Informational [Page 2]
|
---|
116 | |
---|
117 |
|
---|
118 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
119 |
|
---|
120 |
|
---|
121 | * Is compatible with the file format produced by the current
|
---|
122 | widely used gzip utility, in that conforming decompressors
|
---|
123 | will be able to read data produced by the existing gzip
|
---|
124 | compressor.
|
---|
125 |
|
---|
126 | The data format defined by this specification does not attempt to:
|
---|
127 |
|
---|
128 | * Allow random access to compressed data;
|
---|
129 | * Compress specialized data (e.g., raster graphics) as well
|
---|
130 | as the best currently available specialized algorithms.
|
---|
131 |
|
---|
132 | A simple counting argument shows that no lossless compression
|
---|
133 | algorithm can compress every possible input data set. For the
|
---|
134 | format defined here, the worst case expansion is 5 bytes per 32K-
|
---|
135 | byte block, i.e., a size increase of 0.015% for large data sets.
|
---|
136 | English text usually compresses by a factor of 2.5 to 3;
|
---|
137 | executable files usually compress somewhat less; graphical data
|
---|
138 | such as raster images may compress much more.
|
---|
139 |
|
---|
140 | 1.2. Intended audience
|
---|
141 |
|
---|
142 | This specification is intended for use by implementors of software
|
---|
143 | to compress data into "deflate" format and/or decompress data from
|
---|
144 | "deflate" format.
|
---|
145 |
|
---|
146 | The text of the specification assumes a basic background in
|
---|
147 | programming at the level of bits and other primitive data
|
---|
148 | representations. Familiarity with the technique of Huffman coding
|
---|
149 | is helpful but not required.
|
---|
150 |
|
---|
151 | 1.3. Scope
|
---|
152 |
|
---|
153 | The specification specifies a method for representing a sequence
|
---|
154 | of bytes as a (usually shorter) sequence of bits, and a method for
|
---|
155 | packing the latter bit sequence into bytes.
|
---|
156 |
|
---|
157 | 1.4. Compliance
|
---|
158 |
|
---|
159 | Unless otherwise indicated below, a compliant decompressor must be
|
---|
160 | able to accept and decompress any data set that conforms to all
|
---|
161 | the specifications presented here; a compliant compressor must
|
---|
162 | produce data sets that conform to all the specifications presented
|
---|
163 | here.
|
---|
164 |
|
---|
165 | 1.5. Definitions of terms and conventions used
|
---|
166 |
|
---|
167 | Byte: 8 bits stored or transmitted as a unit (same as an octet).
|
---|
168 | For this specification, a byte is exactly 8 bits, even on machines
|
---|
169 |
|
---|
170 |
|
---|
171 |
|
---|
172 | Deutsch Informational [Page 3]
|
---|
173 | |
---|
174 |
|
---|
175 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
176 |
|
---|
177 |
|
---|
178 | which store a character on a number of bits different from eight.
|
---|
179 | See below, for the numbering of bits within a byte.
|
---|
180 |
|
---|
181 | String: a sequence of arbitrary bytes.
|
---|
182 |
|
---|
183 | 1.6. Changes from previous versions
|
---|
184 |
|
---|
185 | There have been no technical changes to the deflate format since
|
---|
186 | version 1.1 of this specification. In version 1.2, some
|
---|
187 | terminology was changed. Version 1.3 is a conversion of the
|
---|
188 | specification to RFC style.
|
---|
189 |
|
---|
190 | 2. Compressed representation overview
|
---|
191 |
|
---|
192 | A compressed data set consists of a series of blocks, corresponding
|
---|
193 | to successive blocks of input data. The block sizes are arbitrary,
|
---|
194 | except that non-compressible blocks are limited to 65,535 bytes.
|
---|
195 |
|
---|
196 | Each block is compressed using a combination of the LZ77 algorithm
|
---|
197 | and Huffman coding. The Huffman trees for each block are independent
|
---|
198 | of those for previous or subsequent blocks; the LZ77 algorithm may
|
---|
199 | use a reference to a duplicated string occurring in a previous block,
|
---|
200 | up to 32K input bytes before.
|
---|
201 |
|
---|
202 | Each block consists of two parts: a pair of Huffman code trees that
|
---|
203 | describe the representation of the compressed data part, and a
|
---|
204 | compressed data part. (The Huffman trees themselves are compressed
|
---|
205 | using Huffman encoding.) The compressed data consists of a series of
|
---|
206 | elements of two types: literal bytes (of strings that have not been
|
---|
207 | detected as duplicated within the previous 32K input bytes), and
|
---|
208 | pointers to duplicated strings, where a pointer is represented as a
|
---|
209 | pair <length, backward distance>. The representation used in the
|
---|
210 | "deflate" format limits distances to 32K bytes and lengths to 258
|
---|
211 | bytes, but does not limit the size of a block, except for
|
---|
212 | uncompressible blocks, which are limited as noted above.
|
---|
213 |
|
---|
214 | Each type of value (literals, distances, and lengths) in the
|
---|
215 | compressed data is represented using a Huffman code, using one code
|
---|
216 | tree for literals and lengths and a separate code tree for distances.
|
---|
217 | The code trees for each block appear in a compact form just before
|
---|
218 | the compressed data for that block.
|
---|
219 |
|
---|
220 |
|
---|
221 |
|
---|
222 |
|
---|
223 |
|
---|
224 |
|
---|
225 |
|
---|
226 |
|
---|
227 |
|
---|
228 |
|
---|
229 | Deutsch Informational [Page 4]
|
---|
230 | |
---|
231 |
|
---|
232 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
233 |
|
---|
234 |
|
---|
235 | 3. Detailed specification
|
---|
236 |
|
---|
237 | 3.1. Overall conventions In the diagrams below, a box like this:
|
---|
238 |
|
---|
239 | +---+
|
---|
240 | | | <-- the vertical bars might be missing
|
---|
241 | +---+
|
---|
242 |
|
---|
243 | represents one byte; a box like this:
|
---|
244 |
|
---|
245 | +==============+
|
---|
246 | | |
|
---|
247 | +==============+
|
---|
248 |
|
---|
249 | represents a variable number of bytes.
|
---|
250 |
|
---|
251 | Bytes stored within a computer do not have a "bit order", since
|
---|
252 | they are always treated as a unit. However, a byte considered as
|
---|
253 | an integer between 0 and 255 does have a most- and least-
|
---|
254 | significant bit, and since we write numbers with the most-
|
---|
255 | significant digit on the left, we also write bytes with the most-
|
---|
256 | significant bit on the left. In the diagrams below, we number the
|
---|
257 | bits of a byte so that bit 0 is the least-significant bit, i.e.,
|
---|
258 | the bits are numbered:
|
---|
259 |
|
---|
260 | +--------+
|
---|
261 | |76543210|
|
---|
262 | +--------+
|
---|
263 |
|
---|
264 | Within a computer, a number may occupy multiple bytes. All
|
---|
265 | multi-byte numbers in the format described here are stored with
|
---|
266 | the least-significant byte first (at the lower memory address).
|
---|
267 | For example, the decimal number 520 is stored as:
|
---|
268 |
|
---|
269 | 0 1
|
---|
270 | +--------+--------+
|
---|
271 | |00001000|00000010|
|
---|
272 | +--------+--------+
|
---|
273 | ^ ^
|
---|
274 | | |
|
---|
275 | | + more significant byte = 2 x 256
|
---|
276 | + less significant byte = 8
|
---|
277 |
|
---|
278 | 3.1.1. Packing into bytes
|
---|
279 |
|
---|
280 | This document does not address the issue of the order in which
|
---|
281 | bits of a byte are transmitted on a bit-sequential medium,
|
---|
282 | since the final data format described here is byte- rather than
|
---|
283 |
|
---|
284 |
|
---|
285 |
|
---|
286 | Deutsch Informational [Page 5]
|
---|
287 | |
---|
288 |
|
---|
289 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
290 |
|
---|
291 |
|
---|
292 | bit-oriented. However, we describe the compressed block format
|
---|
293 | in below, as a sequence of data elements of various bit
|
---|
294 | lengths, not a sequence of bytes. We must therefore specify
|
---|
295 | how to pack these data elements into bytes to form the final
|
---|
296 | compressed byte sequence:
|
---|
297 |
|
---|
298 | * Data elements are packed into bytes in order of
|
---|
299 | increasing bit number within the byte, i.e., starting
|
---|
300 | with the least-significant bit of the byte.
|
---|
301 | * Data elements other than Huffman codes are packed
|
---|
302 | starting with the least-significant bit of the data
|
---|
303 | element.
|
---|
304 | * Huffman codes are packed starting with the most-
|
---|
305 | significant bit of the code.
|
---|
306 |
|
---|
307 | In other words, if one were to print out the compressed data as
|
---|
308 | a sequence of bytes, starting with the first byte at the
|
---|
309 | *right* margin and proceeding to the *left*, with the most-
|
---|
310 | significant bit of each byte on the left as usual, one would be
|
---|
311 | able to parse the result from right to left, with fixed-width
|
---|
312 | elements in the correct MSB-to-LSB order and Huffman codes in
|
---|
313 | bit-reversed order (i.e., with the first bit of the code in the
|
---|
314 | relative LSB position).
|
---|
315 |
|
---|
316 | 3.2. Compressed block format
|
---|
317 |
|
---|
318 | 3.2.1. Synopsis of prefix and Huffman coding
|
---|
319 |
|
---|
320 | Prefix coding represents symbols from an a priori known
|
---|
321 | alphabet by bit sequences (codes), one code for each symbol, in
|
---|
322 | a manner such that different symbols may be represented by bit
|
---|
323 | sequences of different lengths, but a parser can always parse
|
---|
324 | an encoded string unambiguously symbol-by-symbol.
|
---|
325 |
|
---|
326 | We define a prefix code in terms of a binary tree in which the
|
---|
327 | two edges descending from each non-leaf node are labeled 0 and
|
---|
328 | 1 and in which the leaf nodes correspond one-for-one with (are
|
---|
329 | labeled with) the symbols of the alphabet; then the code for a
|
---|
330 | symbol is the sequence of 0's and 1's on the edges leading from
|
---|
331 | the root to the leaf labeled with that symbol. For example:
|
---|
332 |
|
---|
333 |
|
---|
334 |
|
---|
335 |
|
---|
336 |
|
---|
337 |
|
---|
338 |
|
---|
339 |
|
---|
340 |
|
---|
341 |
|
---|
342 |
|
---|
343 | Deutsch Informational [Page 6]
|
---|
344 | |
---|
345 |
|
---|
346 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
347 |
|
---|
348 |
|
---|
349 | /\ Symbol Code
|
---|
350 | 0 1 ------ ----
|
---|
351 | / \ A 00
|
---|
352 | /\ B B 1
|
---|
353 | 0 1 C 011
|
---|
354 | / \ D 010
|
---|
355 | A /\
|
---|
356 | 0 1
|
---|
357 | / \
|
---|
358 | D C
|
---|
359 |
|
---|
360 | A parser can decode the next symbol from an encoded input
|
---|
361 | stream by walking down the tree from the root, at each step
|
---|
362 | choosing the edge corresponding to the next input bit.
|
---|
363 |
|
---|
364 | Given an alphabet with known symbol frequencies, the Huffman
|
---|
365 | algorithm allows the construction of an optimal prefix code
|
---|
366 | (one which represents strings with those symbol frequencies
|
---|
367 | using the fewest bits of any possible prefix codes for that
|
---|
368 | alphabet). Such a code is called a Huffman code. (See
|
---|
369 | reference [1] in Chapter 5, references for additional
|
---|
370 | information on Huffman codes.)
|
---|
371 |
|
---|
372 | Note that in the "deflate" format, the Huffman codes for the
|
---|
373 | various alphabets must not exceed certain maximum code lengths.
|
---|
374 | This constraint complicates the algorithm for computing code
|
---|
375 | lengths from symbol frequencies. Again, see Chapter 5,
|
---|
376 | references for details.
|
---|
377 |
|
---|
378 | 3.2.2. Use of Huffman coding in the "deflate" format
|
---|
379 |
|
---|
380 | The Huffman codes used for each alphabet in the "deflate"
|
---|
381 | format have two additional rules:
|
---|
382 |
|
---|
383 | * All codes of a given bit length have lexicographically
|
---|
384 | consecutive values, in the same order as the symbols
|
---|
385 | they represent;
|
---|
386 |
|
---|
387 | * Shorter codes lexicographically precede longer codes.
|
---|
388 |
|
---|
389 |
|
---|
390 |
|
---|
391 |
|
---|
392 |
|
---|
393 |
|
---|
394 |
|
---|
395 |
|
---|
396 |
|
---|
397 |
|
---|
398 |
|
---|
399 |
|
---|
400 | Deutsch Informational [Page 7]
|
---|
401 | |
---|
402 |
|
---|
403 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
404 |
|
---|
405 |
|
---|
406 | We could recode the example above to follow this rule as
|
---|
407 | follows, assuming that the order of the alphabet is ABCD:
|
---|
408 |
|
---|
409 | Symbol Code
|
---|
410 | ------ ----
|
---|
411 | A 10
|
---|
412 | B 0
|
---|
413 | C 110
|
---|
414 | D 111
|
---|
415 |
|
---|
416 | I.e., 0 precedes 10 which precedes 11x, and 110 and 111 are
|
---|
417 | lexicographically consecutive.
|
---|
418 |
|
---|
419 | Given this rule, we can define the Huffman code for an alphabet
|
---|
420 | just by giving the bit lengths of the codes for each symbol of
|
---|
421 | the alphabet in order; this is sufficient to determine the
|
---|
422 | actual codes. In our example, the code is completely defined
|
---|
423 | by the sequence of bit lengths (2, 1, 3, 3). The following
|
---|
424 | algorithm generates the codes as integers, intended to be read
|
---|
425 | from most- to least-significant bit. The code lengths are
|
---|
426 | initially in tree[I].Len; the codes are produced in
|
---|
427 | tree[I].Code.
|
---|
428 |
|
---|
429 | 1) Count the number of codes for each code length. Let
|
---|
430 | bl_count[N] be the number of codes of length N, N >= 1.
|
---|
431 |
|
---|
432 | 2) Find the numerical value of the smallest code for each
|
---|
433 | code length:
|
---|
434 |
|
---|
435 | code = 0;
|
---|
436 | bl_count[0] = 0;
|
---|
437 | for (bits = 1; bits <= MAX_BITS; bits++) {
|
---|
438 | code = (code + bl_count[bits-1]) << 1;
|
---|
439 | next_code[bits] = code;
|
---|
440 | }
|
---|
441 |
|
---|
442 | 3) Assign numerical values to all codes, using consecutive
|
---|
443 | values for all codes of the same length with the base
|
---|
444 | values determined at step 2. Codes that are never used
|
---|
445 | (which have a bit length of zero) must not be assigned a
|
---|
446 | value.
|
---|
447 |
|
---|
448 | for (n = 0; n <= max_code; n++) {
|
---|
449 | len = tree[n].Len;
|
---|
450 | if (len != 0) {
|
---|
451 | tree[n].Code = next_code[len];
|
---|
452 | next_code[len]++;
|
---|
453 | }
|
---|
454 |
|
---|
455 |
|
---|
456 |
|
---|
457 | Deutsch Informational [Page 8]
|
---|
458 | |
---|
459 |
|
---|
460 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
461 |
|
---|
462 |
|
---|
463 | }
|
---|
464 |
|
---|
465 | Example:
|
---|
466 |
|
---|
467 | Consider the alphabet ABCDEFGH, with bit lengths (3, 3, 3, 3,
|
---|
468 | 3, 2, 4, 4). After step 1, we have:
|
---|
469 |
|
---|
470 | N bl_count[N]
|
---|
471 | - -----------
|
---|
472 | 2 1
|
---|
473 | 3 5
|
---|
474 | 4 2
|
---|
475 |
|
---|
476 | Step 2 computes the following next_code values:
|
---|
477 |
|
---|
478 | N next_code[N]
|
---|
479 | - ------------
|
---|
480 | 1 0
|
---|
481 | 2 0
|
---|
482 | 3 2
|
---|
483 | 4 14
|
---|
484 |
|
---|
485 | Step 3 produces the following code values:
|
---|
486 |
|
---|
487 | Symbol Length Code
|
---|
488 | ------ ------ ----
|
---|
489 | A 3 010
|
---|
490 | B 3 011
|
---|
491 | C 3 100
|
---|
492 | D 3 101
|
---|
493 | E 3 110
|
---|
494 | F 2 00
|
---|
495 | G 4 1110
|
---|
496 | H 4 1111
|
---|
497 |
|
---|
498 | 3.2.3. Details of block format
|
---|
499 |
|
---|
500 | Each block of compressed data begins with 3 header bits
|
---|
501 | containing the following data:
|
---|
502 |
|
---|
503 | first bit BFINAL
|
---|
504 | next 2 bits BTYPE
|
---|
505 |
|
---|
506 | Note that the header bits do not necessarily begin on a byte
|
---|
507 | boundary, since a block does not necessarily occupy an integral
|
---|
508 | number of bytes.
|
---|
509 |
|
---|
510 |
|
---|
511 |
|
---|
512 |
|
---|
513 |
|
---|
514 | Deutsch Informational [Page 9]
|
---|
515 | |
---|
516 |
|
---|
517 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
518 |
|
---|
519 |
|
---|
520 | BFINAL is set if and only if this is the last block of the data
|
---|
521 | set.
|
---|
522 |
|
---|
523 | BTYPE specifies how the data are compressed, as follows:
|
---|
524 |
|
---|
525 | 00 - no compression
|
---|
526 | 01 - compressed with fixed Huffman codes
|
---|
527 | 10 - compressed with dynamic Huffman codes
|
---|
528 | 11 - reserved (error)
|
---|
529 |
|
---|
530 | The only difference between the two compressed cases is how the
|
---|
531 | Huffman codes for the literal/length and distance alphabets are
|
---|
532 | defined.
|
---|
533 |
|
---|
534 | In all cases, the decoding algorithm for the actual data is as
|
---|
535 | follows:
|
---|
536 |
|
---|
537 | do
|
---|
538 | read block header from input stream.
|
---|
539 | if stored with no compression
|
---|
540 | skip any remaining bits in current partially
|
---|
541 | processed byte
|
---|
542 | read LEN and NLEN (see next section)
|
---|
543 | copy LEN bytes of data to output
|
---|
544 | otherwise
|
---|
545 | if compressed with dynamic Huffman codes
|
---|
546 | read representation of code trees (see
|
---|
547 | subsection below)
|
---|
548 | loop (until end of block code recognized)
|
---|
549 | decode literal/length value from input stream
|
---|
550 | if value < 256
|
---|
551 | copy value (literal byte) to output stream
|
---|
552 | otherwise
|
---|
553 | if value = end of block (256)
|
---|
554 | break from loop
|
---|
555 | otherwise (value = 257..285)
|
---|
556 | decode distance from input stream
|
---|
557 |
|
---|
558 | move backwards distance bytes in the output
|
---|
559 | stream, and copy length bytes from this
|
---|
560 | position to the output stream.
|
---|
561 | end loop
|
---|
562 | while not last block
|
---|
563 |
|
---|
564 | Note that a duplicated string reference may refer to a string
|
---|
565 | in a previous block; i.e., the backward distance may cross one
|
---|
566 | or more block boundaries. However a distance cannot refer past
|
---|
567 | the beginning of the output stream. (An application using a
|
---|
568 |
|
---|
569 |
|
---|
570 |
|
---|
571 | Deutsch Informational [Page 10]
|
---|
572 | |
---|
573 |
|
---|
574 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
575 |
|
---|
576 |
|
---|
577 | preset dictionary might discard part of the output stream; a
|
---|
578 | distance can refer to that part of the output stream anyway)
|
---|
579 | Note also that the referenced string may overlap the current
|
---|
580 | position; for example, if the last 2 bytes decoded have values
|
---|
581 | X and Y, a string reference with <length = 5, distance = 2>
|
---|
582 | adds X,Y,X,Y,X to the output stream.
|
---|
583 |
|
---|
584 | We now specify each compression method in turn.
|
---|
585 |
|
---|
586 | 3.2.4. Non-compressed blocks (BTYPE=00)
|
---|
587 |
|
---|
588 | Any bits of input up to the next byte boundary are ignored.
|
---|
589 | The rest of the block consists of the following information:
|
---|
590 |
|
---|
591 | 0 1 2 3 4...
|
---|
592 | +---+---+---+---+================================+
|
---|
593 | | LEN | NLEN |... LEN bytes of literal data...|
|
---|
594 | +---+---+---+---+================================+
|
---|
595 |
|
---|
596 | LEN is the number of data bytes in the block. NLEN is the
|
---|
597 | one's complement of LEN.
|
---|
598 |
|
---|
599 | 3.2.5. Compressed blocks (length and distance codes)
|
---|
600 |
|
---|
601 | As noted above, encoded data blocks in the "deflate" format
|
---|
602 | consist of sequences of symbols drawn from three conceptually
|
---|
603 | distinct alphabets: either literal bytes, from the alphabet of
|
---|
604 | byte values (0..255), or <length, backward distance> pairs,
|
---|
605 | where the length is drawn from (3..258) and the distance is
|
---|
606 | drawn from (1..32,768). In fact, the literal and length
|
---|
607 | alphabets are merged into a single alphabet (0..285), where
|
---|
608 | values 0..255 represent literal bytes, the value 256 indicates
|
---|
609 | end-of-block, and values 257..285 represent length codes
|
---|
610 | (possibly in conjunction with extra bits following the symbol
|
---|
611 | code) as follows:
|
---|
612 |
|
---|
613 |
|
---|
614 |
|
---|
615 |
|
---|
616 |
|
---|
617 |
|
---|
618 |
|
---|
619 |
|
---|
620 |
|
---|
621 |
|
---|
622 |
|
---|
623 |
|
---|
624 |
|
---|
625 |
|
---|
626 |
|
---|
627 |
|
---|
628 | Deutsch Informational [Page 11]
|
---|
629 | |
---|
630 |
|
---|
631 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
632 |
|
---|
633 |
|
---|
634 | Extra Extra Extra
|
---|
635 | Code Bits Length(s) Code Bits Lengths Code Bits Length(s)
|
---|
636 | ---- ---- ------ ---- ---- ------- ---- ---- -------
|
---|
637 | 257 0 3 267 1 15,16 277 4 67-82
|
---|
638 | 258 0 4 268 1 17,18 278 4 83-98
|
---|
639 | 259 0 5 269 2 19-22 279 4 99-114
|
---|
640 | 260 0 6 270 2 23-26 280 4 115-130
|
---|
641 | 261 0 7 271 2 27-30 281 5 131-162
|
---|
642 | 262 0 8 272 2 31-34 282 5 163-194
|
---|
643 | 263 0 9 273 3 35-42 283 5 195-226
|
---|
644 | 264 0 10 274 3 43-50 284 5 227-257
|
---|
645 | 265 1 11,12 275 3 51-58 285 0 258
|
---|
646 | 266 1 13,14 276 3 59-66
|
---|
647 |
|
---|
648 | The extra bits should be interpreted as a machine integer
|
---|
649 | stored with the most-significant bit first, e.g., bits 1110
|
---|
650 | represent the value 14.
|
---|
651 |
|
---|
652 | Extra Extra Extra
|
---|
653 | Code Bits Dist Code Bits Dist Code Bits Distance
|
---|
654 | ---- ---- ---- ---- ---- ------ ---- ---- --------
|
---|
655 | 0 0 1 10 4 33-48 20 9 1025-1536
|
---|
656 | 1 0 2 11 4 49-64 21 9 1537-2048
|
---|
657 | 2 0 3 12 5 65-96 22 10 2049-3072
|
---|
658 | 3 0 4 13 5 97-128 23 10 3073-4096
|
---|
659 | 4 1 5,6 14 6 129-192 24 11 4097-6144
|
---|
660 | 5 1 7,8 15 6 193-256 25 11 6145-8192
|
---|
661 | 6 2 9-12 16 7 257-384 26 12 8193-12288
|
---|
662 | 7 2 13-16 17 7 385-512 27 12 12289-16384
|
---|
663 | 8 3 17-24 18 8 513-768 28 13 16385-24576
|
---|
664 | 9 3 25-32 19 8 769-1024 29 13 24577-32768
|
---|
665 |
|
---|
666 | 3.2.6. Compression with fixed Huffman codes (BTYPE=01)
|
---|
667 |
|
---|
668 | The Huffman codes for the two alphabets are fixed, and are not
|
---|
669 | represented explicitly in the data. The Huffman code lengths
|
---|
670 | for the literal/length alphabet are:
|
---|
671 |
|
---|
672 | Lit Value Bits Codes
|
---|
673 | --------- ---- -----
|
---|
674 | 0 - 143 8 00110000 through
|
---|
675 | 10111111
|
---|
676 | 144 - 255 9 110010000 through
|
---|
677 | 111111111
|
---|
678 | 256 - 279 7 0000000 through
|
---|
679 | 0010111
|
---|
680 | 280 - 287 8 11000000 through
|
---|
681 | 11000111
|
---|
682 |
|
---|
683 |
|
---|
684 |
|
---|
685 | Deutsch Informational [Page 12]
|
---|
686 | |
---|
687 |
|
---|
688 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
689 |
|
---|
690 |
|
---|
691 | The code lengths are sufficient to generate the actual codes,
|
---|
692 | as described above; we show the codes in the table for added
|
---|
693 | clarity. Literal/length values 286-287 will never actually
|
---|
694 | occur in the compressed data, but participate in the code
|
---|
695 | construction.
|
---|
696 |
|
---|
697 | Distance codes 0-31 are represented by (fixed-length) 5-bit
|
---|
698 | codes, with possible additional bits as shown in the table
|
---|
699 | shown in Paragraph 3.2.5, above. Note that distance codes 30-
|
---|
700 | 31 will never actually occur in the compressed data.
|
---|
701 |
|
---|
702 | 3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
|
---|
703 |
|
---|
704 | The Huffman codes for the two alphabets appear in the block
|
---|
705 | immediately after the header bits and before the actual
|
---|
706 | compressed data, first the literal/length code and then the
|
---|
707 | distance code. Each code is defined by a sequence of code
|
---|
708 | lengths, as discussed in Paragraph 3.2.2, above. For even
|
---|
709 | greater compactness, the code length sequences themselves are
|
---|
710 | compressed using a Huffman code. The alphabet for code lengths
|
---|
711 | is as follows:
|
---|
712 |
|
---|
713 | 0 - 15: Represent code lengths of 0 - 15
|
---|
714 | 16: Copy the previous code length 3 - 6 times.
|
---|
715 | The next 2 bits indicate repeat length
|
---|
716 | (0 = 3, ... , 3 = 6)
|
---|
717 | Example: Codes 8, 16 (+2 bits 11),
|
---|
718 | 16 (+2 bits 10) will expand to
|
---|
719 | 12 code lengths of 8 (1 + 6 + 5)
|
---|
720 | 17: Repeat a code length of 0 for 3 - 10 times.
|
---|
721 | (3 bits of length)
|
---|
722 | 18: Repeat a code length of 0 for 11 - 138 times
|
---|
723 | (7 bits of length)
|
---|
724 |
|
---|
725 | A code length of 0 indicates that the corresponding symbol in
|
---|
726 | the literal/length or distance alphabet will not occur in the
|
---|
727 | block, and should not participate in the Huffman code
|
---|
728 | construction algorithm given earlier. If only one distance
|
---|
729 | code is used, it is encoded using one bit, not zero bits; in
|
---|
730 | this case there is a single code length of one, with one unused
|
---|
731 | code. One distance code of zero bits means that there are no
|
---|
732 | distance codes used at all (the data is all literals).
|
---|
733 |
|
---|
734 | We can now define the format of the block:
|
---|
735 |
|
---|
736 | 5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
|
---|
737 | 5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
|
---|
738 | 4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
|
---|
739 |
|
---|
740 |
|
---|
741 |
|
---|
742 | Deutsch Informational [Page 13]
|
---|
743 | |
---|
744 |
|
---|
745 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
746 |
|
---|
747 |
|
---|
748 | (HCLEN + 4) x 3 bits: code lengths for the code length
|
---|
749 | alphabet given just above, in the order: 16, 17, 18,
|
---|
750 | 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
|
---|
751 |
|
---|
752 | These code lengths are interpreted as 3-bit integers
|
---|
753 | (0-7); as above, a code length of 0 means the
|
---|
754 | corresponding symbol (literal/length or distance code
|
---|
755 | length) is not used.
|
---|
756 |
|
---|
757 | HLIT + 257 code lengths for the literal/length alphabet,
|
---|
758 | encoded using the code length Huffman code
|
---|
759 |
|
---|
760 | HDIST + 1 code lengths for the distance alphabet,
|
---|
761 | encoded using the code length Huffman code
|
---|
762 |
|
---|
763 | The actual compressed data of the block,
|
---|
764 | encoded using the literal/length and distance Huffman
|
---|
765 | codes
|
---|
766 |
|
---|
767 | The literal/length symbol 256 (end of data),
|
---|
768 | encoded using the literal/length Huffman code
|
---|
769 |
|
---|
770 | The code length repeat codes can cross from HLIT + 257 to the
|
---|
771 | HDIST + 1 code lengths. In other words, all code lengths form
|
---|
772 | a single sequence of HLIT + HDIST + 258 values.
|
---|
773 |
|
---|
774 | 3.3. Compliance
|
---|
775 |
|
---|
776 | A compressor may limit further the ranges of values specified in
|
---|
777 | the previous section and still be compliant; for example, it may
|
---|
778 | limit the range of backward pointers to some value smaller than
|
---|
779 | 32K. Similarly, a compressor may limit the size of blocks so that
|
---|
780 | a compressible block fits in memory.
|
---|
781 |
|
---|
782 | A compliant decompressor must accept the full range of possible
|
---|
783 | values defined in the previous section, and must accept blocks of
|
---|
784 | arbitrary size.
|
---|
785 |
|
---|
786 | 4. Compression algorithm details
|
---|
787 |
|
---|
788 | While it is the intent of this document to define the "deflate"
|
---|
789 | compressed data format without reference to any particular
|
---|
790 | compression algorithm, the format is related to the compressed
|
---|
791 | formats produced by LZ77 (Lempel-Ziv 1977, see reference [2] below);
|
---|
792 | since many variations of LZ77 are patented, it is strongly
|
---|
793 | recommended that the implementor of a compressor follow the general
|
---|
794 | algorithm presented here, which is known not to be patented per se.
|
---|
795 | The material in this section is not part of the definition of the
|
---|
796 |
|
---|
797 |
|
---|
798 |
|
---|
799 | Deutsch Informational [Page 14]
|
---|
800 | |
---|
801 |
|
---|
802 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
803 |
|
---|
804 |
|
---|
805 | specification per se, and a compressor need not follow it in order to
|
---|
806 | be compliant.
|
---|
807 |
|
---|
808 | The compressor terminates a block when it determines that starting a
|
---|
809 | new block with fresh trees would be useful, or when the block size
|
---|
810 | fills up the compressor's block buffer.
|
---|
811 |
|
---|
812 | The compressor uses a chained hash table to find duplicated strings,
|
---|
813 | using a hash function that operates on 3-byte sequences. At any
|
---|
814 | given point during compression, let XYZ be the next 3 input bytes to
|
---|
815 | be examined (not necessarily all different, of course). First, the
|
---|
816 | compressor examines the hash chain for XYZ. If the chain is empty,
|
---|
817 | the compressor simply writes out X as a literal byte and advances one
|
---|
818 | byte in the input. If the hash chain is not empty, indicating that
|
---|
819 | the sequence XYZ (or, if we are unlucky, some other 3 bytes with the
|
---|
820 | same hash function value) has occurred recently, the compressor
|
---|
821 | compares all strings on the XYZ hash chain with the actual input data
|
---|
822 | sequence starting at the current point, and selects the longest
|
---|
823 | match.
|
---|
824 |
|
---|
825 | The compressor searches the hash chains starting with the most recent
|
---|
826 | strings, to favor small distances and thus take advantage of the
|
---|
827 | Huffman encoding. The hash chains are singly linked. There are no
|
---|
828 | deletions from the hash chains; the algorithm simply discards matches
|
---|
829 | that are too old. To avoid a worst-case situation, very long hash
|
---|
830 | chains are arbitrarily truncated at a certain length, determined by a
|
---|
831 | run-time parameter.
|
---|
832 |
|
---|
833 | To improve overall compression, the compressor optionally defers the
|
---|
834 | selection of matches ("lazy matching"): after a match of length N has
|
---|
835 | been found, the compressor searches for a longer match starting at
|
---|
836 | the next input byte. If it finds a longer match, it truncates the
|
---|
837 | previous match to a length of one (thus producing a single literal
|
---|
838 | byte) and then emits the longer match. Otherwise, it emits the
|
---|
839 | original match, and, as described above, advances N bytes before
|
---|
840 | continuing.
|
---|
841 |
|
---|
842 | Run-time parameters also control this "lazy match" procedure. If
|
---|
843 | compression ratio is most important, the compressor attempts a
|
---|
844 | complete second search regardless of the length of the first match.
|
---|
845 | In the normal case, if the current match is "long enough", the
|
---|
846 | compressor reduces the search for a longer match, thus speeding up
|
---|
847 | the process. If speed is most important, the compressor inserts new
|
---|
848 | strings in the hash table only when no match was found, or when the
|
---|
849 | match is not "too long". This degrades the compression ratio but
|
---|
850 | saves time since there are both fewer insertions and fewer searches.
|
---|
851 |
|
---|
852 |
|
---|
853 |
|
---|
854 |
|
---|
855 |
|
---|
856 | Deutsch Informational [Page 15]
|
---|
857 | |
---|
858 |
|
---|
859 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
860 |
|
---|
861 |
|
---|
862 | 5. References
|
---|
863 |
|
---|
864 | [1] Huffman, D. A., "A Method for the Construction of Minimum
|
---|
865 | Redundancy Codes", Proceedings of the Institute of Radio
|
---|
866 | Engineers, September 1952, Volume 40, Number 9, pp. 1098-1101.
|
---|
867 |
|
---|
868 | [2] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
|
---|
869 | Compression", IEEE Transactions on Information Theory, Vol. 23,
|
---|
870 | No. 3, pp. 337-343.
|
---|
871 |
|
---|
872 | [3] Gailly, J.-L., and Adler, M., ZLIB documentation and sources,
|
---|
873 | available in ftp://ftp.uu.net/pub/archiving/zip/doc/
|
---|
874 |
|
---|
875 | [4] Gailly, J.-L., and Adler, M., GZIP documentation and sources,
|
---|
876 | available as gzip-*.tar in ftp://prep.ai.mit.edu/pub/gnu/
|
---|
877 |
|
---|
878 | [5] Schwartz, E. S., and Kallick, B. "Generating a canonical prefix
|
---|
879 | encoding." Comm. ACM, 7,3 (Mar. 1964), pp. 166-169.
|
---|
880 |
|
---|
881 | [6] Hirschberg and Lelewer, "Efficient decoding of prefix codes,"
|
---|
882 | Comm. ACM, 33,4, April 1990, pp. 449-459.
|
---|
883 |
|
---|
884 | 6. Security Considerations
|
---|
885 |
|
---|
886 | Any data compression method involves the reduction of redundancy in
|
---|
887 | the data. Consequently, any corruption of the data is likely to have
|
---|
888 | severe effects and be difficult to correct. Uncompressed text, on
|
---|
889 | the other hand, will probably still be readable despite the presence
|
---|
890 | of some corrupted bytes.
|
---|
891 |
|
---|
892 | It is recommended that systems using this data format provide some
|
---|
893 | means of validating the integrity of the compressed data. See
|
---|
894 | reference [3], for example.
|
---|
895 |
|
---|
896 | 7. Source code
|
---|
897 |
|
---|
898 | Source code for a C language implementation of a "deflate" compliant
|
---|
899 | compressor and decompressor is available within the zlib package at
|
---|
900 | ftp://ftp.uu.net/pub/archiving/zip/zlib/.
|
---|
901 |
|
---|
902 | 8. Acknowledgements
|
---|
903 |
|
---|
904 | Trademarks cited in this document are the property of their
|
---|
905 | respective owners.
|
---|
906 |
|
---|
907 | Phil Katz designed the deflate format. Jean-Loup Gailly and Mark
|
---|
908 | Adler wrote the related software described in this specification.
|
---|
909 | Glenn Randers-Pehrson converted this document to RFC and HTML format.
|
---|
910 |
|
---|
911 |
|
---|
912 |
|
---|
913 | Deutsch Informational [Page 16]
|
---|
914 | |
---|
915 |
|
---|
916 | RFC 1951 DEFLATE Compressed Data Format Specification May 1996
|
---|
917 |
|
---|
918 |
|
---|
919 | 9. Author's Address
|
---|
920 |
|
---|
921 | L. Peter Deutsch
|
---|
922 | Aladdin Enterprises
|
---|
923 | 203 Santa Margarita Ave.
|
---|
924 | Menlo Park, CA 94025
|
---|
925 |
|
---|
926 | Phone: (415) 322-0103 (AM only)
|
---|
927 | FAX: (415) 322-1734
|
---|
928 | EMail: <ghost@aladdin.com>
|
---|
929 |
|
---|
930 | Questions about the technical content of this specification can be
|
---|
931 | sent by email to:
|
---|
932 |
|
---|
933 | Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
|
---|
934 | Mark Adler <madler@alumni.caltech.edu>
|
---|
935 |
|
---|
936 | Editorial comments on this specification can be sent by email to:
|
---|
937 |
|
---|
938 | L. Peter Deutsch <ghost@aladdin.com> and
|
---|
939 | Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
|
---|
940 |
|
---|
941 |
|
---|
942 |
|
---|
943 |
|
---|
944 |
|
---|
945 |
|
---|
946 |
|
---|
947 |
|
---|
948 |
|
---|
949 |
|
---|
950 |
|
---|
951 |
|
---|
952 |
|
---|
953 |
|
---|
954 |
|
---|
955 |
|
---|
956 |
|
---|
957 |
|
---|
958 |
|
---|
959 |
|
---|
960 |
|
---|
961 |
|
---|
962 |
|
---|
963 |
|
---|
964 |
|
---|
965 |
|
---|
966 |
|
---|
967 |
|
---|
968 |
|
---|
969 |
|
---|
970 | Deutsch Informational [Page 17]
|
---|
971 | |
---|
972 |
|
---|