source: nikanabo/current/xdelta/diy/draft-korn-vcdiff.txt @ 185

Last change on this file since 185 was 185, checked in by geyser, 14 years ago
File size: 59.3 KB
Line 
1                                                     David G. Korn, AT&T Labs
2                                             Joshua P. MacDonald, UC Berkeley
3                                                 Jeffrey C. Mogul, Compaq WRL
4Internet-Draft                                       Kiem-Phong Vo, AT&T Labs
5Expires: 09 November 2002                                    09 November 2001
6
7
8        The VCDIFF Generic Differencing and Compression Data Format
9
10                         draft-korn-vcdiff-06.txt
11
12
13
14Status of this Memo
15
16    This document is an Internet-Draft and is in full conformance
17    with all provisions of Section 10 of RFC2026.
18
19    Internet-Drafts are working documents of the Internet Engineering
20    Task Force (IETF), its areas, and its working groups.  Note that
21    other groups may also distribute working documents as
22    Internet-Drafts.
23
24    Internet-Drafts are draft documents valid for a maximum of six
25    months and may be updated, replaced, or obsoleted by other
26    documents at any time.  It is inappropriate to use Internet-
27    Drafts as reference material or to cite them other than as
28    "work in progress."
29
30    The list of current Internet-Drafts can be accessed at
31    http://www.ietf.org/ietf/1id-abstracts.txt
32
33    The list of Internet-Draft Shadow Directories can be accessed at
34    http://www.ietf.org/shadow.html.
35
36
37Abstract
38
39    This memo describes a general, efficient and portable data format
40    suitable for encoding compressed and/or differencing data so that
41    they can be easily transported among computers.
42
43
44Table of Contents:
45
46    1.  EXECUTIVE SUMMARY ............................................  2
47    2.  CONVENTIONS ..................................................  3
48    3.  DELTA INSTRUCTIONS ...........................................  4
49    4.  DELTA FILE ORGANIZATION ......................................  5
50    5.  DELTA INSTRUCTION ENCODING ...................................  9
51    6.  DECODING A TARGET WINDOW ..................................... 14
52    7.  APPLICATION-DEFINED CODE TABLES .............................. 16
53    8.  PERFORMANCE .................................................. 16
54    9.  FURTHER ISSUES ............................................... 17
55   10.  SUMMARY ...................................................... 18
56   11.  ACKNOWLEDGEMENTS ............................................. 18
57   12.  SECURITY CONSIDERATIONS ...................................... 18
58   13.  SOURCE CODE AVAILABILITY ..................................... 18
59   14.  INTELLECTUAL PROPERTY RIGHTS ................................. 18
60   15.  IANA CONSIDERATIONS .......................................... 19
61   16.  REFERENCES ................................................... 19
62   17.  AUTHOR'S ADDRESS ............................................. 20
63
64
651.  EXECUTIVE SUMMARY
66
67    Compression and differencing techniques can greatly improve storage
68    and transmission of files and file versions.  Since files are often
69    transported across machines with distinct architectures and performance
70    characteristics, such data should be encoded in a form that is portable
71    and can be decoded with little or no knowledge of the encoders.
72    This document describes Vcdiff, a compact portable encoding format
73    designed for these purposes.
74
75    Data differencing is the process of computing a compact and invertible
76    encoding of a "target file" given a "source file".  Data compression
77    is similar but without the use of source data.  The UNIX utilities diff,
78    compress, and gzip are well-known examples of data differencing and
79    compression tools.  For data differencing, the computed encoding is
80    called a "delta file", and, for data compression, it is called
81    a "compressed file".  Delta and compressed files are good for storage
82    and transmission as they are often smaller than the originals.
83
84    Data differencing and data compression are traditionally treated
85    as distinct types of data processing.  However, as shown in the Vdelta
86    technique by Korn and Vo [1], compression can be thought of as a special
87    case of differencing in which the source data is empty. The basic idea
88    is to unify the string parsing scheme used in the Lempel-Ziv'77 style
89    compressors [2], and the block-move technique of Tichy [3].  Loosely
90    speaking, this works as follows:
91
92        a. Concatenate source and target data.
93        b. Parse the data from left to right as in LZ'77 but
94           make sure that a parsed segment starts the target data.
95        c. Start to output when reaching target data.
96
97    Parsing is based on string matching algorithms such as suffix trees [4]
98    or hashing with different time and space performance characteristics.
99    Vdelta uses a fast string matching algorithm that requires less memory
100    than other techniques [5,6].  However, even with this algorithm, the
101    memory requirement can still be prohibitive for large files.  A common
102    way to deal with memory limitation is to partition an input file into
103    chunks called "windows" and process them separately. Here, except for
104    unpublished work by Vo, little has been done on designing effective
105    windowing schemes. Current techniques, including Vdelta, simply use
106    source and target windows with corresponding addresses across source
107    and target files.
108
109    String matching and windowing algorithms have large influence on the
110    compression rate of delta and compressed files. However, it is desirable
111    to have a portable encoding format that is independent of such algorithms.
112    This enables construction of client-server applications in which a server
113    may serve clients with unknown computing characteristics.  Unfortunately,
114    all current differencing and compressing tools, including Vdelta, fall
115    short in this respect. Their storage formats are closely intertwined
116    with the implemented string matching and/or windowing algorithms.
117
118    The encoding format Vcdiff proposed here addresses the above issues.
119    Vcdiff achieves the below characteristics:
120
121        Output compactness:
122            The basic encoding format compactly represents compressed or delta
123            files. Applications can further extend the basic encoding format
124            with "secondary encoders" to achieve more compression.
125
126        Data portability:
127            The basic encoding format is free from machine byte order and
128            word size issues. This allows data to be encoded on one machine
129            and decoded on a different machine with different architecture.
130
131        Algorithm genericity:
132            The decoding algorithm is independent from string matching and
133            windowing algorithms. This allows competition among implementations
134            of the encoder while keeping the same decoder.
135
136        Decoding efficiency:
137            Except for secondary encoder issues, the decoding algorithm runs
138            in time proportional to the size of the target file and uses space
139            proportional to the maximal window size.  Vcdiff differs from more
140            conventional compressors in that it uses only byte-aligned
141            data, thus avoiding bit-level operations, which improves
142            decoding speed at the slight cost of compression efficiency.
143
144    The Vcdiff data format and the algorithms for decoding data shall be
145    described next.  Since Vcdiff treats compression as a special case of
146    differencing, we shall use the term "delta file" to indicate the
147    compressed output for both cases.
148
149
1502. CONVENTIONS
151
152    The basic data unit is a byte.  For portability, Vcdiff shall limit
153    a byte to its lower eight bits even on machines with larger bytes.
154    The bits in a byte are ordered from right to left so that the least
155    significant bit (LSB) has value 1, and the most significant bit (MSB),
156    has value 128.
157
158    For purposes of exposition in this document, we adopt the convention
159    that the LSB is numbered 0, and the MSB is numbered 7.  Bit numbers
160    never appear in the encoded format itself.
161
162    Vcdiff encodes unsigned integer values using a portable variable-sized
163    format (originally introduced in the Sfio library [7]). This encoding
164    treats an integer as a number in base 128. Then, each digit in this
165    representation is encoded in the lower seven bits of a byte. Except for
166    the least significant byte, other bytes have their most significant bit
167    turned on to indicate that there are still more digits in the encoding.
168    The two key properties of this integer encoding that are beneficial
169    to a data compression format are:
170
171        a. The encoding is portable among systems using 8-bit bytes, and
172        b. Small values are encoded compactly.
173
174    For example, consider the value 123456789 which can be represented with
175    four 7-bit digits whose values are 58, 111, 26, 21 in order from most
176    to least significant. Below is the 8-bit byte encoding of these digits.
177    Note that the MSBs of 58, 111 and 26 are on.
178
179                 +-------------------------------------------+
180                 | 10111010 | 11101111 | 10011010 | 00010101 |
181                 +-------------------------------------------+
182                   MSB+58     MSB+111    MSB+26     0+21
183
184
185    Henceforth, the terms "byte" and "integer" will refer to a byte and an
186    unsigned integer as described.
187
188
189    From time to time, algorithms are exhibited to clarify the descriptions
190    of parts of the Vcdiff format. On such occasions, the C language will be
191    used to make precise the algorithms.  The C code shown in this
192    document is meant for clarification only, and is not part of the
193    actual specification of the Vcdiff format.
194
195    In this specification, the key words "MUST", "MUST NOT",
196    "SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as
197    described in RFC2119 [12].
198
199
2003.  DELTA INSTRUCTIONS
201
202    A large target file is partitioned into non-overlapping sections
203    called "target windows". These target windows are processed separately
204    and sequentially based on their order in the target file.
205
206    A target window T of length t may be compared against some source data
207    segment S of length s. By construction, this source data segment S
208    comes either from the source file, if one is used, or from a part of
209    the target file earlier than T.  In this way, during decoding, S is
210    completely known when T is being decoded.
211
212    The choices of T, t, S and s are made by some window selection algorithm
213    which can greatly affect the size of the encoding. However, as seen later,
214    these choices are encoded so that no knowledge of the window selection
215    algorithm is needed during decoding.
216
217    Assume that S[j] represents the jth byte in S, and T[k] represents
218    the kth byte in T.  Then, for the delta instructions, we treat the data
219    windows S and T as substrings of a superstring U formed by concatenating
220    them like this:
221
222        S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]
223
224    The "address" of a byte in S or T is referred to by its location in U.
225    For example, the address of T[k] is s+k.
226
227    The instructions to encode and direct the reconstruction of a target
228    window are called delta instructions. There are three types:
229
230        ADD: This instruction has two arguments, a size x and a sequence of
231            x bytes to be copied.
232        COPY: This instruction has two arguments, a size x and an address p
233            in the string U. The arguments specify the substring of U that
234            must be copied. We shall assert that such a substring must be
235            entirely contained in either S or T.
236        RUN: This instruction has two arguments, a size x and a byte b that
237            will be repeated x times.
238
239    Below are example source and target windows and the delta instructions
240    that encode the target window in terms of the source window.
241
242        a b c d e f g h i j k l m n o p
243        a b c d w x y z e f g h e f g h e f g h e f g h z z z z
244
245        COPY  4, 0
246        ADD   4, w x y z
247        COPY  4, 4
248        COPY 12, 24
249        RUN   4, z
250
251
252    Thus, the first letter 'a' in the target window is at location 16
253    in the superstring. Note that the fourth instruction, "COPY 12, 24",
254    copies data from T itself since address 24 is position 8 in T.
255    This instruction also shows that it is fine to overlap the data to be
256    copied with the data being copied from as long as the latter starts
257    earlier. This enables efficient encoding of periodic sequences,
258    i.e., sequences with regularly repeated subsequences. The RUN instruction
259    is a compact way to encode a sequence repeating the same byte even though
260    such a sequence can be thought of as a periodic sequence with period 1.
261
262    To reconstruct the target window, one simply processes one delta
263    instruction at a time and copy the data either from the source window
264    or the being reconstructed target window based on the type of the
265    instruction and the associated address, if any.
266
267
2684.  DELTA FILE ORGANIZATION
269
270    A Vcdiff delta file starts with a Header section followed by a sequence
271    of Window sections. The Header section includes magic bytes to identify
272    the file type, and information concerning data processing beyond the
273    basic encoding format. The Window sections encode the target windows.
274
275    Below is the overall organization of a delta file. The indented items
276    refine the ones immediately above them. An item in square brackets may
277    or may not be present in the file depending on the information encoded
278    in the Indicator byte above it.
279
280        Header
281            Header1                                  - byte
282            Header2                                  - byte
283            Header3                                  - byte
284            Header4                                  - byte
285            Hdr_Indicator                            - byte
286            [Secondary compressor ID]                - byte
287
288[@@@ Why is compressor ID not an integer? ]
289[@@@ If we aren't defining any secondary compressors yet, then it seems
290that defining the [Secondary compressor ID] and the corresponding
291VCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value.  An
292implementation of this specification won't be able to decode a VCDIFF
293encoded with this option if it doesn't know about any secondary
294compressors.  It seems that you should specify the bits related to
295secondary compressors once you have defined the first a secondary
296compressor.  I can imagine a secondary-compressor might want to supply
297extra information, such as a dictionary of some kind, in which case
298this speculative treatment wouldn't go far enough.]
299
300            [Length of code table data]              - integer
301            [Code table data]
302                Size of near cache                   - byte
303                Size of same cache                   - byte
304                Compressed code table data
305        Window1
306            Win_Indicator                            - byte
307            [Source segment size]                    - integer
308            [Source segment position]                - integer
309            The delta encoding of the target window
310                Length of the delta encoding         - integer
311                The delta encoding
312                    Size of the target window        - integer
313                    Delta_Indicator                  - byte
314                    Length of data for ADDs and RUNs - integer
315                    Length of instructions and sizes - integer
316                    Length of addresses for COPYs    - integer
317                    Data section for ADDs and RUNs   - array of bytes
318                    Instructions and sizes section   - array of bytes
319                    Addresses section for COPYs      - array of bytes
320        Window2
321        ...
322
323
324
3254.1 The Header Section
326
327    Each delta file starts with a header section organized as below.
328    Note the convention that square-brackets enclose optional items.
329
330            Header1                                  - byte = 0xE6
331            Header2                                  - byte = 0xD3
332            Header3                                  - byte = 0xD4
333
334HMMM
335
3360xD6
3370xC3
3380xC4
339
340            Header4                                  - byte
341            Hdr_Indicator                            - byte
342            [Secondary compressor ID]                - byte
343            [Length of code table data]              - integer
344            [Code table data]
345
346    The first three Header bytes are the ASCII characters 'V', 'C' and 'D'
347    with their most significant bits turned on (in hexadecimal, the values
348    are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to
349    zero. In the future, it might be used to indicate the version of Vcdiff.
350
351    The Hdr_Indicator byte shows if there are any initialization data
352    required to aid in the reconstruction of data in the Window sections.
353    This byte MAY have non-zero values for either, both, or neither of
354    the two bits VCD_DECOMPRESS and VCD_CODETABLE below:
355
356            7 6 5 4 3 2 1 0
357           +-+-+-+-+-+-+-+-+
358           | | | | | | | | |
359           +-+-+-+-+-+-+-+-+
360                        ^ ^
361                        | |
362                        | +-- VCD_DECOMPRESS
363                        +---- VCD_CODETABLE
364
365    If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary
366    compressor may have been used to further compress certain parts of the
367    delta encoding data as described in Sections 4.3 and 6. In that case,
368    the ID of the secondary compressor is given next. If this bit is zero,
369    the compressor ID byte is not included.
370
371[@@@ If we aren't defining any secondary compressors yet, then it seems
372this bit has no real value yet..]
373
374    If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an
375    application-defined code table is to be used for decoding the delta
376    instructions. This table itself is compressed.  The length of the data
377    comprising this compressed code table and the data follow next. Section 7
378    discusses application-defined code tables.  If this bit is zero, the code
379    table data length and the code table data are not included.
380
381    If both bits are set, then the compressor ID byte is included
382    before the code table data length and the code table data.
383
384
3854.2 The Format of a Window Section
386
387    Each Window section is organized as follows:
388
389            Win_Indicator                            - byte
390            [Source segment length]                  - integer
391            [Source segment position]                - integer
392            The delta encoding of the target window
393
394
395    Below are the detail of the various items:
396
397[@@@ Here, I want to replace the Win_Indicator with a source-count,
398followed by source-count length/position pairs?]
399
400        Win_Indicator:
401            This byte is a set of bits, as shown:
402
403            7 6 5 4 3 2 1 0
404           +-+-+-+-+-+-+-+-+
405           | | | | | | | | |
406           +-+-+-+-+-+-+-+-+
407                        ^ ^
408                        | |
409                        | +-- VCD_SOURCE
410                        +---- VCD_TARGET
411
412
413            If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment
414            of data from the "source" file was used as the corresponding
415            source window of data to encode the target window. The decoder
416            will use this same source data segment to decode the target window.
417
418            If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment
419            of data from the "target" file was used as the corresponding
420            source window of data to encode the target window. As above, this
421            same source data segment is used to decode the target window.
422
423            The Win_Indicator byte MUST NOT have more than one of the bits
424            set (non-zero).  It MAY have none of these bits set.
425
426            If one of these bits is set, the byte is followed by two
427            integers to indicate respectively the length and position of
428            the source data segment in the relevant file.  If the
429            indicator byte is zero, the target window was compressed
430            by itself without comparing against another data segment,
431            and these two integers are not included.
432
433        The delta encoding of the target window:
434            This contains the delta encoding of the target window either
435            in terms of the source data segment (i.e., VCD_SOURCE
436            or VCD_TARGET was set) or by itself if no source window
437            is specified. This data format is discussed next.
438
439
4404.3 The Delta Encoding of a Target Window
441
442    The delta encoding of a target window is organized as follows:
443
444        Length of the delta encoding            - integer
445        The delta encoding
446            Length of the target window         - integer
447            Delta_Indicator                     - byte
448            Length of data for ADDs and RUNs    - integer
449            Length of instructions section      - integer
450            Length of addresses for COPYs       - integer
451            Data section for ADDs and RUNs      - array of bytes
452            Instructions and sizes section      - array of bytes
453            Addresses section for COPYs         - array of bytes
454
455
456        Length of the delta encoding:
457            This integer gives the total number of remaining bytes that
458            comprise data of the delta encoding for this target window.
459
460        The delta encoding:
461            This contains the data representing the delta encoding which
462            is described next.
463
464        Length of the target window:
465            This integer indicates the actual size of the target window
466            after decompression. A decoder can use this value to allocate
467            memory to store the uncompressed data.
468
469        Delta_Indicator:
470            This byte is a set of bits, as shown:
471
472            7 6 5 4 3 2 1 0
473           +-+-+-+-+-+-+-+-+
474           | | | | | | | | |
475           +-+-+-+-+-+-+-+-+
476                      ^ ^ ^
477                      | | |
478                      | | +-- VCD_DATACOMP
479                      | +---- VCD_INSTCOMP
480                      +------ VCD_ADDRCOMP
481
482                VCD_DATACOMP:   bit value 1.
483                VCD_INSTCOMP:   bit value 2.
484                VCD_ADDRCOMP:   bit value 4.
485
486            As discussed, the delta encoding consists of COPY, ADD and RUN
487            instructions. The ADD and RUN instructions have accompanying
488            unmatched data (that is, data that does not specifically match
489            any data in the source window or in some earlier part of the
490            target window) and the COPY instructions have addresses of where
491            the matches occur. OPTIONALLY, these types of data MAY be further
492            compressed using a secondary compressor. Thus, Vcdiff separates
493            the encoding of the delta instructions into three parts:
494
495                a. The unmatched data in the ADD and RUN instructions,
496                b. The delta instructions and accompanying sizes, and
497                c. The addresses of the COPY instructions.
498
499            If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these
500            sections may have been compressed using the specified secondary
501            compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP),
502            and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that
503            the corresponding parts are compressed. Then, these parts MUST
504            be decompressed before decoding the delta instructions.
505
506        Length of data for ADDs and RUNs:
507            This is the length (in bytes) of the section of data storing
508            the unmatched data accompanying the ADD and RUN instructions.
509
510        Length of instructions section:
511            This is the length (in bytes) of the delta instructions and
512            accompanying sizes.
513
514        Length of addresses for COPYs:
515            This is the length (in bytes) of the section storing
516            the addresses of the COPY instructions.
517
518        Data section for ADDs and RUNs:
519            This sequence of bytes encodes the unmatched data for the ADD
520            and RUN instructions.
521
522        Instructions and sizes section:
523            This sequence of bytes encodes the instructions and their sizes.
524
525        Addresses section for COPYs:
526            This sequence of bytes encodes the addresses of the COPY
527            instructions.
528
529
5305. DELTA INSTRUCTION ENCODING
531
532    The delta instructions described in Section 3 represent the results of
533    string matching. For many data differencing applications in which the
534    changes between source and target data are small, any straightforward
535    representation of these instructions would be adequate.  However, for
536    applications including data compression, it is important to encode
537    these instructions well to achieve good compression rates.  From our
538    experience, the following observations can be made:
539
540    a. The addresses in COPY instructions are locations of matches and
541       often occur close by or even exactly equal to one another. This is
542       because data in local regions are often replicated with minor changes.
543       In turn, this means that coding a newly matched address against some
544       set of recently matched addresses can be beneficial.
545
546    b. The matches are often short in length and separated by small amounts
547       of unmatched data. That is, the lengths of COPY and ADD instructions
548       are often small. This is particularly true of binary data such as
549       executable files or structured data such as HTML or XML. In such cases,
550       compression can be improved by combining the encoding of the sizes
551       and the instruction types as well as combining the encoding of adjacent
552       delta instructions with sufficiently small data sizes.
553
554    The below subsections discuss how the Vcdiff data format provides
555    mechanisms enabling encoders to use the above observations to improve
556    compression rates.
557
558
5595.1 Address Encoding Modes of COPY Instructions
560
561    As mentioned earlier, addresses of COPY instructions often occur close
562    to one another or are exactly equal. To take advantage of this phenomenon
563    and encode addresses of COPY instructions more efficiently, the Vcdiff
564    data format supports the use of two different types of address caches.
565    Both the encoder and decoder maintain these caches, so that decoder's
566    caches remain synchronized with the encoder's caches.
567
568    a. A "near" cache is an array with "s_near" slots, each containing an
569       address used for encoding addresses nearby to previously encoded
570       addresses (in the positive direction only).  The near cache also
571       maintains a "next_slot" index to the near cache.  New entries to the
572       near cache are always inserted in the next_slot index, which maintains
573       a circular buffer of the s_near most recent addresses.
574
575    b. A "same" cache is an array with "s_same" multiple of 256 slots, each
576       containing an address.  The same cache maintains a hash table of recent
577       addresses used for repeated encoding of the exact same address.
578
579
580    By default, the parameters s_near and s_same are respectively set to 4
581    and 3. An encoder MAY modify these values, but then it MUST encode the
582    new values in the encoding itself, as discussed in Section 7, so that
583    the decoder can properly set up its own caches.
584
585    At the start of processing a target window, an implementation
586    (encoder or decoder) initializes all of the slots in both caches
587    to zero.  The next_slot pointer of the near cache is set
588    to point to slot zero.
589
590    Each time a COPY instruction is processed by the encoder or
591    decoder, the implementation's caches are updated as follows, where
592    "addr" is the address in the COPY instruction.
593
594    a. The slot in the near cache referenced by the next_slot
595       index is set to addr.  The next_slot index is then incremented
596       modulo s_near.
597
598    b. The slot in the same cache whose index is addr%(s_same*256)
599       is set to addr. [We use the C notations of % for modulo and
600       * for multiplication.]
601
602
6035.2 Example code for maintaining caches
604
605    To make clear the above description, below are example cache data
606    structures and algorithms to initialize and update them:
607
608        typedef struct _cache_s
609        {
610            int*  near;      /* array of size s_near        */
611            int   s_near;
612            int   next_slot; /* the circular index for near */
613            int*  same;      /* array of size s_same*256    */
614            int   s_same;
615        } Cache_t;
616
617        cache_init(Cache_t* ka)
618        {
619            int   i;
620
621            ka->next_slot = 0;
622            for(i = 0; i < ka->s_near; ++i)
623                 ka->near[i] = 0;
624
625            for(i = 0; i < ka->s_same*256; ++i)
626                 ka->same[i] = 0;
627        }
628
629        cache_update(Cache_t* ka, int addr)
630        {
631            if(ka->s_near > 0)
632            {   ka->near[ka->next_slot] = addr;
633                ka->next_slot = (ka->next_slot + 1) % ka->s_near;
634            }
635
636            if(ka->s_same > 0)
637                ka->same[addr % (ka->s_same*256)] = addr;
638        }
639
640
6415.3 Encoding of COPY instruction addresses
642
643    The address of a COPY instruction is encoded using different modes
644    depending on the type of cached address used, if any.
645
646    Let "addr" be the address of a COPY instruction to be decoded and "here"
647    be the current location in the target data (i.e., the start of the data
648    about to be encoded or decoded).  Let near[j] be the jth element in
649    the near cache, and same[k] be the kth element in the same cache.
650    Below are the possible address modes:
651
652        VCD_SELF: This mode has value 0. The address was encoded by itself
653            as an integer.
654
655        VCD_HERE: This mode has value 1. The address was encoded as
656            the integer value "here - addr".
657
658        Near modes: The "near modes" are in the range [2,s_near+1]. Let m
659            be the mode of the address encoding. The address was encoded
660            as the integer value "addr - near[m-2]".
661
662        Same modes: The "same modes" are in the range
663            [s_near+2,s_near+s_same+1]. Let m be the mode of the encoding.
664            The address was encoded as a single byte b such that
665            "addr == same[(m - (s_near+2))*256 + b]".
666
667
6685.3 Example code for encoding and decoding of COPY instruction addresses
669
670    We show example algorithms below to demonstrate use of address modes more
671    clearly. The encoder has freedom to choose address modes, the sample
672    addr_encode() algorithm merely shows one way of picking the address
673    mode. The decoding algorithm addr_decode() will uniquely decode addresses
674    regardless of the encoder's algorithm choice.
675
676    Note that the address caches are updated immediately after an address is
677    encoded or decoded. In this way, the decoder is always synchronized with
678    the encoder.
679
680        int addr_encode(Cache_t* ka, int addr, int here, int* mode)
681        {
682            int  i, d, bestd, bestm;
683
684            /* Attempt to find the address mode that yields the
685             * smallest integer value for "d", the encoded address
686             * value, thereby minimizing the encoded size of the
687             * address. */
688
689            bestd = addr; bestm = VCD_SELF;      /* VCD_SELF == 0 */
690
691            if((d = here-addr) < bestd)
692                { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */
693
694            for(i = 0; i < ka->s_near; ++i)
695                if((d = addr - ka->near[i]) >= 0 && d < bestd)
696                    { bestd = d; bestm = i+2; }
697
698            if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)
699                { bestd = d%256; bestm = ka->s_near + 2 + d/256; }
700
701            cache_update(ka,addr);
702
703            *mode = bestm; /* this returns the address encoding mode */
704            return  bestd; /* this returns the encoded address       */
705        }
706
707    Note that the addr_encode() algorithm chooses the best address mode using a
708    local optimization, but that may not lead to the best encoding efficiency
709    because different modes lead to different instruction encodings, as    described below.
710
711    The functions addrint() and addrbyte() used in addr_decode() obtain from
712    the "Addresses section for COPYs" (Section 4.3) an integer or a byte,
713    respectively. These utilities will not be described here.  We simply
714    recall that an integer is represented as a compact variable-sized string
715    of bytes as described in Section 2 (i.e., base 128).
716
717        int addr_decode(Cache_t* ka, int here, int mode)
718        {   int  addr, m;
719
720            if(mode == VCD_SELF)
721                 addr = addrint();
722            else if(mode == VCD_HERE)
723                 addr = here - addrint();
724            else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
725                 addr = ka->near[m] + addrint();
726            else /* same cache */
727            {    m = mode - (2 + ka->s_near);
728                 addr = ka->same[m*256 + addrbyte()];
729            }
730
731            cache_update(ka, addr);
732
733            return addr;
734        }
735
736
7375.4 Instruction Codes
738
739    As noted, the data sizes associated with delta instructions are often
740    small. Thus, compression efficiency can be improved by combining the sizes
741    and instruction types in a single encoding, as well by combining certain
742    pairs of adjacent delta instructions. Effective choices of when to perform
743    such combinations depend on many factors including the data being processed
744    and the string matching algorithm in use. For example, if many COPY
745    instructions have the same data sizes, it may be worth to encode these
746    instructions more compactly than others.
747
748    The Vcdiff data format is designed so that a decoder does not need to be
749    aware of the choices made in encoding algorithms. This is achieved with the
750    notion of an "instruction code table" containing 256 entries. Each entry
751    defines either a single delta instruction or a pair of instructions that
752    have been combined.  Note that the code table itself only exists in main
753    memory, not in the delta file (unless using an application-defined code
754    table, described in Section 7). The encoded data simply includes the index
755    of each instruction and, since there are only 256 indices, each index
756    can be represented as a single byte.
757
758    Each instruction code entry contains six fields, each of which
759    is a single byte with unsigned value:
760
761            +-----------------------------------------------+
762            | inst1 | size1 | mode1 | inst2 | size2 | mode2 |
763            +-----------------------------------------------+
764
765@@@ could be more compact
766
767    Each triple (inst,size,mode) defines a delta instruction. The meanings
768    of these fields are as follows:
769
770    inst: An "inst" field can have one of the four values: NOOP (0), ADD (1),
771        RUN (2) or COPY (3) to indicate the instruction types. NOOP means
772        that no instruction is specified. In this case, both the corresponding
773        size and mode fields will be zero.
774
775    size: A "size" field is zero or positive. A value zero means that the
776        size associated with the instruction is encoded separately as
777        an integer in the "Instructions and sizes section" (Section 6).
778        A positive value for "size" defines the actual data size.
779        Note that since the size is restricted to a byte, the maximum
780        value for any instruction with size implicitly defined in the code
781        table is 255.
782
783    mode: A "mode" field is significant only when the associated delta
784        instruction is a COPY. It defines the mode used to encode the
785        associated addresses. For other instructions, this is always zero.
786
787
7885.5 The Code Table
789
790    Following the discussions on address modes and instruction code tables,
791    we define a "Code Table" to have the data below:
792
793        s_near: the size of the near cache,
794        s_same: the size of the same cache,
795        i_code: the 256-entry instruction code table.
796
797    Vcdiff itself defines a "default code table" in which s_near is 4
798    and s_same is 3. Thus, there are 9 address modes for a COPY instruction.
799    The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5
800    are for addresses coded against the near cache. And, modes 6, 7  and 8
801    are for addresses coded against the same cache.
802
803    The default instruction code table is depicted below, in a compact
804    representation that we use only for descriptive purposes.  See section 7
805    for the specification of how an instruction code table is represented
806    in the Vcdiff encoding format.  In the depiction, a zero value for
807    size indicates that the size is separately coded. The mode of non-COPY
808    instructions is represented as 0 even though they are not used.
809
810
811         TYPE      SIZE     MODE    TYPE     SIZE     MODE     INDEX
812        ---------------------------------------------------------------
813     1.  RUN         0        0     NOOP       0        0        0
814     2.  ADD    0, [1,17]     0     NOOP       0        0      [1,18]
815     3.  COPY   0, [4,18]     0     NOOP       0        0     [19,34]
816     4.  COPY   0, [4,18]     1     NOOP       0        0     [35,50]
817     5.  COPY   0, [4,18]     2     NOOP       0        0     [51,66]
818     6.  COPY   0, [4,18]     3     NOOP       0        0     [67,82]
819     7.  COPY   0, [4,18]     4     NOOP       0        0     [83,98]
820     8.  COPY   0, [4,18]     5     NOOP       0        0     [99,114]
821     9.  COPY   0, [4,18]     6     NOOP       0        0    [115,130]
822    10.  COPY   0, [4,18]     7     NOOP       0        0    [131,146]
823    11.  COPY   0, [4,18]     8     NOOP       0        0    [147,162]
824    12.  ADD       [1,4]      0     COPY     [4,6]      0    [163,174]
825    13.  ADD       [1,4]      0     COPY     [4,6]      1    [175,186]
826    14.  ADD       [1,4]      0     COPY     [4,6]      2    [187,198]
827    15.  ADD       [1,4]      0     COPY     [4,6]      3    [199,210]
828    16.  ADD       [1,4]      0     COPY     [4,6]      4    [211,222]
829    17.  ADD       [1,4]      0     COPY     [4,6]      5    [223,234]
830    18.  ADD       [1,4]      0     COPY       4        6    [235,238]
831    19.  ADD       [1,4]      0     COPY       4        7    [239,242]
832    20.  ADD       [1,4]      0     COPY       4        8    [243,246]
833    21.  COPY        4      [0,8]   ADD        1        0    [247,255]
834        ---------------------------------------------------------------
835
836    In the above depiction, each numbered line represents one or more
837    entries in the actual instruction code table (recall that an entry in
838    the instruction code table may represent up to two combined delta
839    instructions.) The last column ("INDEX") shows which index value or
840    range of index values of the entries covered by that line. The notation
841    [i,j] means values from i through j, inclusive. The first 6 columns of
842    a line in the depiction describe the pairs of instructions used for
843    the corresponding index value(s).
844
845    If a line in the depiction includes a column entry using the [i,j]
846    notation, this means that the line is instantiated for each value
847    in the range from i to j, inclusive.  The notation "0, [i,j]" means
848    that the line is instantiated for the value 0 and for each value
849    in the range from i to j, inclusive.
850
851    If a line in the depiction includes more than one entry using the [i,j]
852    notation, implying a "nested loop" to convert the line to a range of
853    table entries, the first such [i,j] range specifies the outer loop,
854    and the second specifies the inner loop.
855
856    The below examples should make clear the above description:
857
858    Line 1 shows the single RUN instruction with index 0. As the size field
859    is 0, this RUN instruction always has its actual size encoded separately.
860
861    Line 2 shows the 18 single ADD instructions. The ADD instruction with
862    size field 0 (i.e., the actual size is coded separately) has index 1.
863    ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and
864    their sizes are as given (so they will not be separately encoded.)
865
866    Following the single ADD instructions are the single COPY instructions
867    ordered by their address encoding modes. For example, line 11 shows the
868    COPY instructions with mode 8, i.e., the last of the same cache.
869    In this case, the COPY instruction with size field 0 has index 147.
870    Again, the actual size of this instruction will be coded separately.
871
872    Lines 12 to 21 show the pairs of instructions that are combined together.
873    For example, line 12 depicts the 12 entries in which an ADD instruction
874    is combined with an immediately following COPY instruction. The entries
875    with indices 163, 164, 165 represent the pairs in which the ADD
876    instructions all have size 1 while the COPY instructions has mode
877    0 (VCD_SELF) and sizes 4, 5 and 6 respectively.
878
879    The last line, line 21, shows the eight instruction pairs where the first
880    instruction is a COPY and the second is an ADD. In this case, all COPY
881    instructions have size 4 with mode ranging from 0 to 8 and all the ADD
882    instructions have size 1. Thus, the entry with largest index 255
883    combines a COPY instruction of size 4 and mode 8 with an ADD instruction
884    of size 1.
885
886    The choice of the minimum size 4 for COPY instructions in the default code
887    table was made from experiments that showed that excluding small matches
888    (less then 4 bytes long) improved the compression rates.
889
890
8916. DECODING A TARGET WINDOW
892
893    Section 4.3 discusses that the delta instructions and associated data
894    are encoded in three arrays of bytes:
895
896        Data section for ADDs and RUNs,
897        Instructions and sizes section, and
898        Addresses section for COPYs.
899
900
901    Further, these data sections may have been further compressed by some
902    secondary compressor. Assuming that any such compressed data has been
903    decompressed so that we now have three arrays:
904
905        inst: bytes coding the instructions and sizes.
906        data: unmatched data associated with ADDs and RUNs.
907        addr: bytes coding the addresses of COPYs.
908
909    These arrays are organized as follows:
910
911        inst:
912            a sequence of (index, [size1], [size2]) tuples, where "index"
913            is an index into the instruction code table, and size1 and size2
914            are integers that MAY or MAY NOT be included in the tuple as
915            follows. The entry with the given "index" in the instruction
916            code table potentially defines two delta instructions. If the
917            first delta instruction is not a VCD_NOOP and its size is zero,
918            then size1 MUST be present. Otherwise, size1 MUST be omitted and
919            the size of the instruction (if it is not VCD_NOOP) is as defined
920            in the table. The presence or absence of size2 is defined
921            similarly with respect to the second delta instruction.
922
923        data:
924            a sequence of data values, encoded as bytes.
925
926        addr:
927            a sequence of address values. Addresses are normally encoded as
928            integers as described in Section 2 (i.e., base 128).
929            Since the same cache emits addresses in the range [0,255],
930            however, same cache addresses are always encoded as a
931            single byte.
932
933    To summarize, each tuple in the "inst" array includes an index to some
934    entry in the instruction code table that determines:
935
936    a. Whether one or two instructions were encoded and their types.
937
938    b. If the instructions have their sizes encoded separately, these
939       sizes will follow, in order, in the tuple.
940
941    c. If the instructions have accompanying data, i.e., ADDs or RUNs,
942       their data will be in the array "data".
943
944    d. Similarly, if the instructions are COPYs, the coded addresses are
945       found in the array "addr".
946
947    The decoding procedure simply processes the arrays by reading one code
948    index at a time, looking up the corresponding instruction code entry,
949    then consuming the respective sizes, data and addresses following the
950    directions in this entry. In other words, the decoder maintains an implicit
951    next-element pointer for each array; "consuming" an instruction tuple,
952    data, or address value implies incrementing the associated pointer.
953
954    For example, if during the processing of the target window, the next
955    unconsumed tuple in the inst array has index value 19, then the first
956    instruction is a COPY, whose size is found as the immediately following
957    integer in the inst array.  Since the mode of this COPY instruction is
958    VCD_SELF, the corresponding address is found by consuming the next
959    integer in the addr array.  The data array is left intact. As the second
960    instruction for code index 19 is a NOOP, this tuple is finished.
961
962
9637. APPLICATION-DEFINED CODE TABLES
964
965    Although the default code table used in Vcdiff is good for general
966    purpose encoders, there are times when other code tables may perform
967    better. For example, to code a file with many identical segments of data,
968    it may be advantageous to have a COPY instruction with the specific size
969    of these data segments so that the instruction can be encoded in a single
970    byte. Such a special code table MUST then be encoded in the delta file
971    so that the decoder can reconstruct it before decoding the data.
972
973    Vcdiff allows an application-defined code table to be specified
974    in a delta file with the following data:
975
976        Size of near cache            - byte
977        Size of same cache            - byte
978        Compressed code table data
979
980    The "compressed code table data" encodes the delta between the default
981    code table (source) and the new code table (target) in the same manner as
982    described in Section 4.3 for encoding a target window in terms of a
983    source window. This delta is computed using the following steps:
984
985    a.  Convert the new instruction code table into a string, "code", of
986        1536 bytes using the below steps in order:
987
988        i. Add in order the 256 bytes representing the types of the first
989           instructions in the instruction pairs.
990       ii. Add in order the 256 bytes representing the types of the second
991           instructions in the instruction pairs.
992      iii. Add in order the 256 bytes representing the sizes of the first
993           instructions in the instruction pairs.
994       iv. Add in order the 256 bytes representing the sizes of the second
995           instructions in the instruction pairs.
996        v. Add in order the 256 bytes representing the modes of the first
997           instructions in the instruction pairs.
998       vi. Add in order the 256 bytes representing the modes of the second
999           instructions in the instruction pairs.
1000
1001    b.  Similarly, convert the default instruction code table into
1002        a string "dflt".
1003
1004    c.  Treat the string "code" as a target window and "dflt" as the
1005        corresponding source data and apply an encoding algorithm to
1006        compute the delta encoding of "code" in terms of "dflt".
1007        This computation MUST use the default code table for encoding
1008        the delta instructions.
1009
1010    The decoder can then reverse the above steps to decode the compressed
1011    table data using the method of Section 6, employing the default code
1012    table, to generate the new code table.  Note that the decoder does not
1013    need to know anything about the details of the encoding algorithm used
1014    in step (c). The decoder is still able to decode the new code table
1015    because the Vcdiff format is independent from the choice of encoding
1016    algorithm, and because the encoder in step (c) uses the known, default
1017    code table.
1018
1019
10208. PERFORMANCE
1021
1022    The encoding format is compact. For compression only, using the LZ-77
1023    string parsing strategy and without any secondary compressors, the typical
1024    compression rate is better than Unix compress and close to gzip.  For
1025    differencing, the data format is better than all known methods in
1026    terms of its stated goal, which is primarily decoding speed and
1027    encoding efficiency.
1028
1029    We compare the performance of compress, gzip and Vcdiff using the
1030    archives of three versions of the Gnu C compiler, gcc-2.95.1.tar,
1031    gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an
1032    SGI-MIPS3, 400MHZ. Gzip was used at its default compression level.
1033    Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13).
1034    As string and window matching typically dominates the computation during
1035    compression, the Vcdiff compression times were directly due to the
1036    algorithms used in the Vcodex/Vcdiff software. However, the decompression
1037    times should be generic and representative of any good implementation
1038    of the Vcdiff data format. Timing was done by running each program
1039    three times and taking the average of the total cpu+system times.
1040
1041    Below are the different Vcdiff runs:
1042
1043        Vcdiff: vcdiff is used as compressor only.
1044
1045        Vcdiff-d: vcdiff is used as a differencer only. That is, it only
1046                compares target data against source data.  Since the files
1047                involved are large, they are broken into windows. In this
1048                case, each target window starting at some file offset in
1049                the target file is compared against a source window with
1050                the same file offset (in the source file). The source
1051                window is also slightly larger than the target window
1052                to increase matching opportunities. The -d option also gives
1053                a hint to the string matching algorithm of Vcdiff that
1054                the two files are very similar with long stretches of matches.
1055                The algorithm takes advantage of this to minimize its
1056                processing of source data and save time.
1057
1058        Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare
1059                target data against target data as applicable. Thus, vcdiff
1060                both computes differences and compresses data. The windowing
1061                algorithm is the same as above. However, the above hint is
1062                recinded in this case.
1063
1064        Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm
1065                uses a content-based heuristic to select source data segments
1066                that are more likely to match with a given target window.
1067                Thus, the source data segment selected for a target window
1068                often will not be aligned with the file offsets of this
1069                target window.
1070
1071
1072                gcc-2.95.1    gcc-2.95.2    compression   decompression
1073    raw size      55746560      55797760
1074    compress         -          19939390       13.85s         7.09s
1075    gzip             -          12973443       42.99s         5.35s
1076    Vcdiff           -          15358786       20.04s         4.65s
1077    Vcdiff-d         -            100971       10.93s         1.92s
1078    Vcdiff-dc        -             97246       20.03s         1.84s
1079    Vcdiff-dcs       -            256445       44.81s         1.84s
1080
1081                TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1
1082
1083
1084    TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the
1085    sizes of the compressed results. As a pure compressor, the compression
1086    rate for Vcdiff is worse than gzip and better than compress. The last
1087    three rows shows that when two file versions are very similar, differencing
1088    can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use
1089    the same simple window selection method but Vcdiff-dc also does compression
1090    so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on
1091    data content to search for source data that likely will match a given target
1092    window. Although it does a good job, the heuristic did not always find the
1093    best matches which are given by the simple algorithm of Vcdiff-d.  As a
1094    result, the output size is slightly larger. Note also that there is a large
1095    cost in computing matching windows this way. Finally, the compression times
1096    of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude
1097    that the compression feature causes the additional time in Vcdiff-dc
1098    relative to Vcdiff-d.  However, this is not the case. The hint given to
1099    the Vcdiff string matching algorithm that the two files are likely to
1100    have very long stretches of matches helps the algorithm to minimize
1101    processing of the "source data", thus saving half the time. However, as we
1102    shall see below when this hint is wrong, the result is even longer time.
1103
1104
1105                gcc-2.95.2    gcc-2.95.3    compression   decompression
1106    raw size      55797760      55787520
1107    compress         -          19939453       13.54s         7.00s
1108    gzip             -          12998097       42.63s         5.62s
1109    Vcdiff           -          15371737       20.09s         4.74s
1110    Vcdiff-d         -          26383849       71.41s         6.41s
1111    Vcdiff-dc        -          14461203       42.48s         4.82s
1112    Vcdiff-dcs       -           1248543       61.18s         1.99s
1113
1114                TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2
1115
1116
1117    TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and
1118    the sizes of the compressed results. In this case, the tar file of
1119    gcc-2.95.3 is rearranged in a way that makes the straightforward method
1120    of matching file offsets for source and target windows fail. As a
1121    result, Vcdiff-d performs rather dismally both in time and output size.
1122    The large time for Vcdiff-d is directly due to fact that the string
1123    matching algorithm has to work much harder to find matches when the hint
1124    that two files have long matching stretches fails to hold. On the other
1125    hand, Vcdiff-dc does both differencing and compression resulting in good
1126    output size. Finally, the window searching heuristic used in Vcdiff-dcs is
1127    effective in finding the right matching source windows for target windows
1128    resulting a small output size. This shows why the data format needs to
1129    have a way to specify matching windows to gain performance. Finally,
1130    we note that the decoding times are always good regardless of how
1131    the string matching or window searching algorithms perform.
1132
1133
11349. FURTHER ISSUES
1135
1136    This document does not address a few issues:
1137
1138    Secondary compressors:
1139        As discussed in Section 4.3, certain sections in the delta encoding
1140        of a window may be further compressed by a secondary compressor.
1141        In our experience, the basic Vcdiff format is adequate for most
1142        purposes so that secondary compressors are seldom needed. In
1143        particular, for normal use of data differencing where the files to
1144        be compared have long stretches of matches, much of the gain in
1145        compression rate is already achieved by normal string matching.
1146        Thus, the use of secondary compressors is seldom needed in this case.
1147        However, for applications beyond differencing of such nearly identical
1148        files, secondary compressors may be needed to achieve maximal
1149        compressed results.
1150
1151        Therefore, we recommend to leave the Vcdiff data format defined
1152        as in this document so that the use of secondary compressors
1153        can be implemented when they become needed in the future.
1154        The formats of the compressed data via such compressors or any
1155        compressors that may be defined in the future are left open to
1156        their implementations.  These could include Huffman encoding,
1157        arithmetic encoding, and splay tree encoding [8,9].
1158
1159    Large file system vs. small file system:
1160        As discussed in Section 4, a target window in a large file may be
1161        compared against some source window in another file or in the same
1162        file (from some earlier part). In that case, the file offset of the
1163        source window is specified as a variable-sized integer in the delta
1164        encoding. There is a possibility that the encoding was computed on
1165        a system supporting much larger files than in a system where
1166        the data may be decoded (e.g., 64-bit file systems vs. 32-bit file
1167        systems). In that case, some target data may not be recoverable.
1168        This problem could afflict any compression format, and ought
1169        to be resolved with a generic negotiation mechanism in the
1170        appropriate protocol(s).
1171
1172
117310.  SUMMARY
1174
1175    We have described Vcdiff, a general and portable encoding format for
1176    compression and differencing. The format is good in that it allows
1177    implementing a decoder without knowledge of the encoders. Further,
1178    ignoring the use of secondary compressors not defined within the format,
1179    the decoding algorithms runs in linear time and requires working space
1180    proportional to window sizes.
1181
1182
1183
118411. ACKNOWLEDGEMENTS
1185
1186    Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff
1187    who provided much encouragement to publicize Vcdiff. In particular, Jeff
1188    helped clarifying the description of the data format presented here.
1189
1190
1191
119212. SECURITY CONSIDERATIONS
1193
1194    Vcdiff only provides a format to encode compressed and differenced data.
1195    It does not address any issues concerning how such data are, in fact,
1196    stored in a given file system or the run-time memory of a computer system.
1197    Therefore, we do not anticipate any security issues with respect to Vcdiff.
1198
1199
1200
120113. SOURCE CODE AVAILABILITY
1202
1203    Vcdiff is implemented as a data transforming method in Phong Vo's
1204    Vcodex library. AT&T Corp. has made the source code for Vcodex available
1205    for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11].
1206    The source code and according license is accessible at the below URL:
1207
1208          http://www.research.att.com/sw/tools
1209
1210
121114. INTELLECTUAL PROPERTY RIGHTS
1212
1213   The IETF has been notified of intellectual property rights claimed in
1214   regard to some or all of the specification contained in this
1215   document.  For more information consult the online list of claimed
1216   rights, at <http://www.ietf.org/ipr.html>.
1217
1218   The IETF takes no position regarding the validity or scope of any
1219   intellectual property or other rights that might be claimed to
1220   pertain to the implementation or use of the technology described in
1221   this document or the extent to which any license under such rights
1222   might or might not be available; neither does it represent that it
1223   has made any effort to identify any such rights.  Information on the
1224   IETF's procedures with respect to rights in standards-track and
1225   standards-related documentation can be found in BCP-11.  Copies of
1226   claims of rights made available for publication and any assurances of
1227   licenses to be made available, or the result of an attempt made to
1228   obtain a general license or permission for the use of such
1229   proprietary rights by implementors or users of this specification can
1230   be obtained from the IETF Secretariat.
1231
1232
1233
123415. IANA CONSIDERATIONS
1235
1236   The Internet Assigned Numbers Authority (IANA) administers the number
1237   space for Secondary Compressor ID values.  Values and their meaning
1238   must be documented in an RFC or other peer-reviewed, permanent, and
1239   readily available reference, in sufficient detail so that
1240   interoperability between independent implementations is possible.
1241   Subject to these constraints, name assignments are First Come, First
1242   Served - see RFC2434 [13].  Legal ID values are in the range 1..255.
1243
1244   This document does not define any values in this number space.
1245
1246
124716. REFERENCES
1248
1249    [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
1250        Practical Reusable Unix Software, Editor B. Krishnamurthy,
1251        John Wiley & Sons, Inc., 1995.
1252
1253    [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
1254        Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977.
1255
1256    [3] W. Tichy, The String-to-String Correction Problem with Block Moves,
1257        ACM Transactions on Computer Systems, 2(4):309-321, November 1984.
1258
1259    [4] E.M. McCreight, A Space-Economical Suffix Tree Construction
1260        Algorithm, Journal of the ACM, 23:262-272, 1976.
1261
1262    [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms,
1263        IEEE Software Configuration and Maintenance Workshop, 1996.
1264
1265    [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis,
1266        ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998.
1267
1268    [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
1269        Proc. of the Summer '91 Usenix Conference, 1991.
1270
1271    [8] D. W. Jones, Application of Splay Trees to Data Compression,
1272        CACM, 31(8):996:1007.
1273
1274    [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1,
1275        M&T Books, New York, NY, 1995.
1276
1277   [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy,
1278        Potential benefits of delta encoding and data compression for HTTP,
1279        SIGCOMM '97, Cannes, France, 1997.
1280
1281   [11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann,
1282        Y. Goland, and A. Van Hoff, Delta Encoding in HTTP,
1283        IETF, draft-mogul-http-delta-10, 2001.
1284
1285   [12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels,
1286        RFC 2119, March 1997.
1287
1288   [13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA
1289        Considerations Section in RFCs, RFC2434, October 1998.
1290
1291
1292
129317. AUTHOR'S ADDRESS
1294
1295    Kiem-Phong Vo (main contact)
1296    AT&T Labs, Room D223
1297    180 Park Avenue
1298    Florham Park, NJ 07932
1299    Email: kpv@research.att.com
1300    Phone: 1 973 360 8630
1301
1302    David G. Korn
1303    AT&T Labs, Room D237
1304    180 Park Avenue
1305    Florham Park, NJ 07932
1306    Email: dgk@research.att.com
1307    Phone: 1 973 360 8602
1308
1309    Jeffrey C. Mogul
1310    Western Research Laboratory
1311    Compaq Computer Corporation
1312    250 University Avenue
1313    Palo Alto, California, 94305, U.S.A.
1314    Email: JeffMogul@acm.org
1315    Phone: 1 650 617 3304 (email preferred)
1316
1317    Joshua P. MacDonald
1318    Computer Science Division
1319    University of California, Berkeley
1320    345 Soda Hall
1321    Berkeley, CA 94720
1322    Email: jmacd@cs.berkeley.edu
Note: See TracBrowser for help on using the repository browser.