source: Vago/zlib-1.2.8/doc/rfc1952.txt@ 1069

Last change on this file since 1069 was 1049, checked in by s10k, 8 years ago
File size: 24.5 KB
Line 
1
2
3
4
5
6
7Network Working Group P. Deutsch
8Request for Comments: 1952 Aladdin Enterprises
9Category: Informational May 1996
10
11
12 GZIP file format specification version 4.3
13
14Status of This Memo
15
16 This memo provides information for the Internet community. This memo
17 does not specify an Internet standard of any kind. Distribution of
18 this memo is unlimited.
19
20IESG Note:
21
22 The IESG takes no position on the validity of any Intellectual
23 Property Rights statements contained in this document.
24
25Notices
26
27 Copyright (c) 1996 L. Peter Deutsch
28
29 Permission is granted to copy and distribute this document for any
30 purpose and without charge, including translations into other
31 languages and incorporation into compilations, provided that the
32 copyright notice and this notice are preserved, and that any
33 substantive changes or deletions from the original are clearly
34 marked.
35
36 A pointer to the latest version of this and related documentation in
37 HTML format can be found at the URL
38 <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>.
39
40Abstract
41
42 This specification defines a lossless compressed data format that is
43 compatible with the widely used GZIP utility. The format includes a
44 cyclic redundancy check value for detecting data corruption. The
45 format presently uses the DEFLATE method of compression but can be
46 easily extended to use other compression methods. The format can be
47 implemented readily in a manner not covered by patents.
48
49
50
51
52
53
54
55
56
57
58Deutsch Informational [Page 1]
59
60
61RFC 1952 GZIP File Format Specification May 1996
62
63
64Table of Contents
65
66 1. Introduction ................................................... 2
67 1.1. Purpose ................................................... 2
68 1.2. Intended audience ......................................... 3
69 1.3. Scope ..................................................... 3
70 1.4. Compliance ................................................ 3
71 1.5. Definitions of terms and conventions used ................. 3
72 1.6. Changes from previous versions ............................ 3
73 2. Detailed specification ......................................... 4
74 2.1. Overall conventions ....................................... 4
75 2.2. File format ............................................... 5
76 2.3. Member format ............................................. 5
77 2.3.1. Member header and trailer ........................... 6
78 2.3.1.1. Extra field ................................... 8
79 2.3.1.2. Compliance .................................... 9
80 3. References .................................................. 9
81 4. Security Considerations .................................... 10
82 5. Acknowledgements ........................................... 10
83 6. Author's Address ........................................... 10
84 7. Appendix: Jean-Loup Gailly's gzip utility .................. 11
85 8. Appendix: Sample CRC Code .................................. 11
86
871. Introduction
88
89 1.1. Purpose
90
91 The purpose of this specification is to define a lossless
92 compressed data format that:
93
94 * Is independent of CPU type, operating system, file system,
95 and character set, and hence can be used for interchange;
96 * Can compress or decompress a data stream (as opposed to a
97 randomly accessible file) to produce another data stream,
98 using only an a priori bounded amount of intermediate
99 storage, and hence can be used in data communications or
100 similar structures such as Unix filters;
101 * Compresses data with efficiency comparable to the best
102 currently available general-purpose compression methods,
103 and in particular considerably better than the "compress"
104 program;
105 * Can be implemented readily in a manner not covered by
106 patents, and hence can be practiced freely;
107 * Is compatible with the file format produced by the current
108 widely used gzip utility, in that conforming decompressors
109 will be able to read data produced by the existing gzip
110 compressor.
111
112
113
114
115Deutsch Informational [Page 2]
116
117
118RFC 1952 GZIP File Format Specification May 1996
119
120
121 The data format defined by this specification does not attempt to:
122
123 * Provide random access to compressed data;
124 * Compress specialized data (e.g., raster graphics) as well as
125 the best currently available specialized algorithms.
126
127 1.2. Intended audience
128
129 This specification is intended for use by implementors of software
130 to compress data into gzip format and/or decompress data from gzip
131 format.
132
133 The text of the specification assumes a basic background in
134 programming at the level of bits and other primitive data
135 representations.
136
137 1.3. Scope
138
139 The specification specifies a compression method and a file format
140 (the latter assuming only that a file can store a sequence of
141 arbitrary bytes). It does not specify any particular interface to
142 a file system or anything about character sets or encodings
143 (except for file names and comments, which are optional).
144
145 1.4. Compliance
146
147 Unless otherwise indicated below, a compliant decompressor must be
148 able to accept and decompress any file that conforms to all the
149 specifications presented here; a compliant compressor must produce
150 files that conform to all the specifications presented here. The
151 material in the appendices is not part of the specification per se
152 and is not relevant to compliance.
153
154 1.5. Definitions of terms and conventions used
155
156 byte: 8 bits stored or transmitted as a unit (same as an octet).
157 (For this specification, a byte is exactly 8 bits, even on
158 machines which store a character on a number of bits different
159 from 8.) See below for the numbering of bits within a byte.
160
161 1.6. Changes from previous versions
162
163 There have been no technical changes to the gzip format since
164 version 4.1 of this specification. In version 4.2, some
165 terminology was changed, and the sample CRC code was rewritten for
166 clarity and to eliminate the requirement for the caller to do pre-
167 and post-conditioning. Version 4.3 is a conversion of the
168 specification to RFC style.
169
170
171
172Deutsch Informational [Page 3]
173
174
175RFC 1952 GZIP File Format Specification May 1996
176
177
1782. Detailed specification
179
180 2.1. Overall conventions
181
182 In the diagrams below, a box like this:
183
184 +---+
185 | | <-- the vertical bars might be missing
186 +---+
187
188 represents one byte; a box like this:
189
190 +==============+
191 | |
192 +==============+
193
194 represents a variable number of bytes.
195
196 Bytes stored within a computer do not have a "bit order", since
197 they are always treated as a unit. However, a byte considered as
198 an integer between 0 and 255 does have a most- and least-
199 significant bit, and since we write numbers with the most-
200 significant digit on the left, we also write bytes with the most-
201 significant bit on the left. In the diagrams below, we number the
202 bits of a byte so that bit 0 is the least-significant bit, i.e.,
203 the bits are numbered:
204
205 +--------+
206 |76543210|
207 +--------+
208
209 This document does not address the issue of the order in which
210 bits of a byte are transmitted on a bit-sequential medium, since
211 the data format described here is byte- rather than bit-oriented.
212
213 Within a computer, a number may occupy multiple bytes. All
214 multi-byte numbers in the format described here are stored with
215 the least-significant byte first (at the lower memory address).
216 For example, the decimal number 520 is stored as:
217
218 0 1
219 +--------+--------+
220 |00001000|00000010|
221 +--------+--------+
222 ^ ^
223 | |
224 | + more significant byte = 2 x 256
225 + less significant byte = 8
226
227
228
229Deutsch Informational [Page 4]
230
231
232RFC 1952 GZIP File Format Specification May 1996
233
234
235 2.2. File format
236
237 A gzip file consists of a series of "members" (compressed data
238 sets). The format of each member is specified in the following
239 section. The members simply appear one after another in the file,
240 with no additional information before, between, or after them.
241
242 2.3. Member format
243
244 Each member has the following structure:
245
246 +---+---+---+---+---+---+---+---+---+---+
247 |ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
248 +---+---+---+---+---+---+---+---+---+---+
249
250 (if FLG.FEXTRA set)
251
252 +---+---+=================================+
253 | XLEN |...XLEN bytes of "extra field"...| (more-->)
254 +---+---+=================================+
255
256 (if FLG.FNAME set)
257
258 +=========================================+
259 |...original file name, zero-terminated...| (more-->)
260 +=========================================+
261
262 (if FLG.FCOMMENT set)
263
264 +===================================+
265 |...file comment, zero-terminated...| (more-->)
266 +===================================+
267
268 (if FLG.FHCRC set)
269
270 +---+---+
271 | CRC16 |
272 +---+---+
273
274 +=======================+
275 |...compressed blocks...| (more-->)
276 +=======================+
277
278 0 1 2 3 4 5 6 7
279 +---+---+---+---+---+---+---+---+
280 | CRC32 | ISIZE |
281 +---+---+---+---+---+---+---+---+
282
283
284
285
286Deutsch Informational [Page 5]
287
288
289RFC 1952 GZIP File Format Specification May 1996
290
291
292 2.3.1. Member header and trailer
293
294 ID1 (IDentification 1)
295 ID2 (IDentification 2)
296 These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139
297 (0x8b, \213), to identify the file as being in gzip format.
298
299 CM (Compression Method)
300 This identifies the compression method used in the file. CM
301 = 0-7 are reserved. CM = 8 denotes the "deflate"
302 compression method, which is the one customarily used by
303 gzip and which is documented elsewhere.
304
305 FLG (FLaGs)
306 This flag byte is divided into individual bits as follows:
307
308 bit 0 FTEXT
309 bit 1 FHCRC
310 bit 2 FEXTRA
311 bit 3 FNAME
312 bit 4 FCOMMENT
313 bit 5 reserved
314 bit 6 reserved
315 bit 7 reserved
316
317 If FTEXT is set, the file is probably ASCII text. This is
318 an optional indication, which the compressor may set by
319 checking a small amount of the input data to see whether any
320 non-ASCII characters are present. In case of doubt, FTEXT
321 is cleared, indicating binary data. For systems which have
322 different file formats for ascii text and binary data, the
323 decompressor can use FTEXT to choose the appropriate format.
324 We deliberately do not specify the algorithm used to set
325 this bit, since a compressor always has the option of
326 leaving it cleared and a decompressor always has the option
327 of ignoring it and letting some other program handle issues
328 of data conversion.
329
330 If FHCRC is set, a CRC16 for the gzip header is present,
331 immediately before the compressed data. The CRC16 consists
332 of the two least significant bytes of the CRC32 for all
333 bytes of the gzip header up to and not including the CRC16.
334 [The FHCRC bit was never set by versions of gzip up to
335 1.2.4, even though it was documented with a different
336 meaning in gzip 1.2.4.]
337
338 If FEXTRA is set, optional extra fields are present, as
339 described in a following section.
340
341
342
343Deutsch Informational [Page 6]
344
345
346RFC 1952 GZIP File Format Specification May 1996
347
348
349 If FNAME is set, an original file name is present,
350 terminated by a zero byte. The name must consist of ISO
351 8859-1 (LATIN-1) characters; on operating systems using
352 EBCDIC or any other character set for file names, the name
353 must be translated to the ISO LATIN-1 character set. This
354 is the original name of the file being compressed, with any
355 directory components removed, and, if the file being
356 compressed is on a file system with case insensitive names,
357 forced to lower case. There is no original file name if the
358 data was compressed from a source other than a named file;
359 for example, if the source was stdin on a Unix system, there
360 is no file name.
361
362 If FCOMMENT is set, a zero-terminated file comment is
363 present. This comment is not interpreted; it is only
364 intended for human consumption. The comment must consist of
365 ISO 8859-1 (LATIN-1) characters. Line breaks should be
366 denoted by a single line feed character (10 decimal).
367
368 Reserved FLG bits must be zero.
369
370 MTIME (Modification TIME)
371 This gives the most recent modification time of the original
372 file being compressed. The time is in Unix format, i.e.,
373 seconds since 00:00:00 GMT, Jan. 1, 1970. (Note that this
374 may cause problems for MS-DOS and other systems that use
375 local rather than Universal time.) If the compressed data
376 did not come from a file, MTIME is set to the time at which
377 compression started. MTIME = 0 means no time stamp is
378 available.
379
380 XFL (eXtra FLags)
381 These flags are available for use by specific compression
382 methods. The "deflate" method (CM = 8) sets these flags as
383 follows:
384
385 XFL = 2 - compressor used maximum compression,
386 slowest algorithm
387 XFL = 4 - compressor used fastest algorithm
388
389 OS (Operating System)
390 This identifies the type of file system on which compression
391 took place. This may be useful in determining end-of-line
392 convention for text files. The currently defined values are
393 as follows:
394
395
396
397
398
399
400Deutsch Informational [Page 7]
401
402
403RFC 1952 GZIP File Format Specification May 1996
404
405
406 0 - FAT filesystem (MS-DOS, OS/2, NT/Win32)
407 1 - Amiga
408 2 - VMS (or OpenVMS)
409 3 - Unix
410 4 - VM/CMS
411 5 - Atari TOS
412 6 - HPFS filesystem (OS/2, NT)
413 7 - Macintosh
414 8 - Z-System
415 9 - CP/M
416 10 - TOPS-20
417 11 - NTFS filesystem (NT)
418 12 - QDOS
419 13 - Acorn RISCOS
420 255 - unknown
421
422 XLEN (eXtra LENgth)
423 If FLG.FEXTRA is set, this gives the length of the optional
424 extra field. See below for details.
425
426 CRC32 (CRC-32)
427 This contains a Cyclic Redundancy Check value of the
428 uncompressed data computed according to CRC-32 algorithm
429 used in the ISO 3309 standard and in section 8.1.1.6.2 of
430 ITU-T recommendation V.42. (See http://www.iso.ch for
431 ordering ISO documents. See gopher://info.itu.ch for an
432 online version of ITU-T V.42.)
433
434 ISIZE (Input SIZE)
435 This contains the size of the original (uncompressed) input
436 data modulo 2^32.
437
438 2.3.1.1. Extra field
439
440 If the FLG.FEXTRA bit is set, an "extra field" is present in
441 the header, with total length XLEN bytes. It consists of a
442 series of subfields, each of the form:
443
444 +---+---+---+---+==================================+
445 |SI1|SI2| LEN |... LEN bytes of subfield data ...|
446 +---+---+---+---+==================================+
447
448 SI1 and SI2 provide a subfield ID, typically two ASCII letters
449 with some mnemonic value. Jean-Loup Gailly
450 <gzip@prep.ai.mit.edu> is maintaining a registry of subfield
451 IDs; please send him any subfield ID you wish to use. Subfield
452 IDs with SI2 = 0 are reserved for future use. The following
453 IDs are currently defined:
454
455
456
457Deutsch Informational [Page 8]
458
459
460RFC 1952 GZIP File Format Specification May 1996
461
462
463 SI1 SI2 Data
464 ---------- ---------- ----
465 0x41 ('A') 0x70 ('P') Apollo file type information
466
467 LEN gives the length of the subfield data, excluding the 4
468 initial bytes.
469
470 2.3.1.2. Compliance
471
472 A compliant compressor must produce files with correct ID1,
473 ID2, CM, CRC32, and ISIZE, but may set all the other fields in
474 the fixed-length part of the header to default values (255 for
475 OS, 0 for all others). The compressor must set all reserved
476 bits to zero.
477
478 A compliant decompressor must check ID1, ID2, and CM, and
479 provide an error indication if any of these have incorrect
480 values. It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC
481 at least so it can skip over the optional fields if they are
482 present. It need not examine any other part of the header or
483 trailer; in particular, a decompressor may ignore FTEXT and OS
484 and always produce binary output, and still be compliant. A
485 compliant decompressor must give an error indication if any
486 reserved bit is non-zero, since such a bit could indicate the
487 presence of a new field that would cause subsequent data to be
488 interpreted incorrectly.
489
4903. References
491
492 [1] "Information Processing - 8-bit single-byte coded graphic
493 character sets - Part 1: Latin alphabet No.1" (ISO 8859-1:1987).
494 The ISO 8859-1 (Latin-1) character set is a superset of 7-bit
495 ASCII. Files defining this character set are available as
496 iso_8859-1.* in ftp://ftp.uu.net/graphics/png/documents/
497
498 [2] ISO 3309
499
500 [3] ITU-T recommendation V.42
501
502 [4] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification",
503 available in ftp://ftp.uu.net/pub/archiving/zip/doc/
504
505 [5] Gailly, J.-L., GZIP documentation, available as gzip-*.tar in
506 ftp://prep.ai.mit.edu/pub/gnu/
507
508 [6] Sarwate, D.V., "Computation of Cyclic Redundancy Checks via Table
509 Look-Up", Communications of the ACM, 31(8), pp.1008-1013.
510
511
512
513
514Deutsch Informational [Page 9]
515
516
517RFC 1952 GZIP File Format Specification May 1996
518
519
520 [7] Schwaderer, W.D., "CRC Calculation", April 85 PC Tech Journal,
521 pp.118-133.
522
523 [8] ftp://ftp.adelaide.edu.au/pub/rocksoft/papers/crc_v3.txt,
524 describing the CRC concept.
525
5264. Security Considerations
527
528 Any data compression method involves the reduction of redundancy in
529 the data. Consequently, any corruption of the data is likely to have
530 severe effects and be difficult to correct. Uncompressed text, on
531 the other hand, will probably still be readable despite the presence
532 of some corrupted bytes.
533
534 It is recommended that systems using this data format provide some
535 means of validating the integrity of the compressed data, such as by
536 setting and checking the CRC-32 check value.
537
5385. Acknowledgements
539
540 Trademarks cited in this document are the property of their
541 respective owners.
542
543 Jean-Loup Gailly designed the gzip format and wrote, with Mark Adler,
544 the related software described in this specification. Glenn
545 Randers-Pehrson converted this document to RFC and HTML format.
546
5476. Author's Address
548
549 L. Peter Deutsch
550 Aladdin Enterprises
551 203 Santa Margarita Ave.
552 Menlo Park, CA 94025
553
554 Phone: (415) 322-0103 (AM only)
555 FAX: (415) 322-1734
556 EMail: <ghost@aladdin.com>
557
558 Questions about the technical content of this specification can be
559 sent by email to:
560
561 Jean-Loup Gailly <gzip@prep.ai.mit.edu> and
562 Mark Adler <madler@alumni.caltech.edu>
563
564 Editorial comments on this specification can be sent by email to:
565
566 L. Peter Deutsch <ghost@aladdin.com> and
567 Glenn Randers-Pehrson <randeg@alumni.rpi.edu>
568
569
570
571Deutsch Informational [Page 10]
572
573
574RFC 1952 GZIP File Format Specification May 1996
575
576
5777. Appendix: Jean-Loup Gailly's gzip utility
578
579 The most widely used implementation of gzip compression, and the
580 original documentation on which this specification is based, were
581 created by Jean-Loup Gailly <gzip@prep.ai.mit.edu>. Since this
582 implementation is a de facto standard, we mention some more of its
583 features here. Again, the material in this section is not part of
584 the specification per se, and implementations need not follow it to
585 be compliant.
586
587 When compressing or decompressing a file, gzip preserves the
588 protection, ownership, and modification time attributes on the local
589 file system, since there is no provision for representing protection
590 attributes in the gzip file format itself. Since the file format
591 includes a modification time, the gzip decompressor provides a
592 command line switch that assigns the modification time from the file,
593 rather than the local modification time of the compressed input, to
594 the decompressed output.
595
5968. Appendix: Sample CRC Code
597
598 The following sample code represents a practical implementation of
599 the CRC (Cyclic Redundancy Check). (See also ISO 3309 and ITU-T V.42
600 for a formal specification.)
601
602 The sample code is in the ANSI C programming language. Non C users
603 may find it easier to read with these hints:
604
605 & Bitwise AND operator.
606 ^ Bitwise exclusive-OR operator.
607 >> Bitwise right shift operator. When applied to an
608 unsigned quantity, as here, right shift inserts zero
609 bit(s) at the left.
610 ! Logical NOT operator.
611 ++ "n++" increments the variable n.
612 0xNNN 0x introduces a hexadecimal (base 16) constant.
613 Suffix L indicates a long value (at least 32 bits).
614
615 /* Table of CRCs of all 8-bit messages. */
616 unsigned long crc_table[256];
617
618 /* Flag: has the table been computed? Initially false. */
619 int crc_table_computed = 0;
620
621 /* Make the table for a fast CRC. */
622 void make_crc_table(void)
623 {
624 unsigned long c;
625
626
627
628Deutsch Informational [Page 11]
629
630
631RFC 1952 GZIP File Format Specification May 1996
632
633
634 int n, k;
635 for (n = 0; n < 256; n++) {
636 c = (unsigned long) n;
637 for (k = 0; k < 8; k++) {
638 if (c & 1) {
639 c = 0xedb88320L ^ (c >> 1);
640 } else {
641 c = c >> 1;
642 }
643 }
644 crc_table[n] = c;
645 }
646 crc_table_computed = 1;
647 }
648
649 /*
650 Update a running crc with the bytes buf[0..len-1] and return
651 the updated crc. The crc should be initialized to zero. Pre- and
652 post-conditioning (one's complement) is performed within this
653 function so it shouldn't be done by the caller. Usage example:
654
655 unsigned long crc = 0L;
656
657 while (read_buffer(buffer, length) != EOF) {
658 crc = update_crc(crc, buffer, length);
659 }
660 if (crc != original_crc) error();
661 */
662 unsigned long update_crc(unsigned long crc,
663 unsigned char *buf, int len)
664 {
665 unsigned long c = crc ^ 0xffffffffL;
666 int n;
667
668 if (!crc_table_computed)
669 make_crc_table();
670 for (n = 0; n < len; n++) {
671 c = crc_table[(c ^ buf[n]) & 0xff] ^ (c >> 8);
672 }
673 return c ^ 0xffffffffL;
674 }
675
676 /* Return the CRC of the bytes buf[0..len-1]. */
677 unsigned long crc(unsigned char *buf, int len)
678 {
679 return update_crc(0L, buf, len);
680 }
681
682
683
684
685Deutsch Informational [Page 12]
686
687
Note: See TracBrowser for help on using the repository browser.