[185] | 1 | David G. Korn, AT&T Labs
|
---|
| 2 | Joshua P. MacDonald, UC Berkeley
|
---|
| 3 | Jeffrey C. Mogul, Compaq WRL
|
---|
| 4 | Internet-Draft Kiem-Phong Vo, AT&T Labs
|
---|
| 5 | Expires: 09 November 2002 09 November 2001
|
---|
| 6 |
|
---|
| 7 |
|
---|
| 8 | The VCDIFF Generic Differencing and Compression Data Format
|
---|
| 9 |
|
---|
| 10 | draft-korn-vcdiff-06.txt
|
---|
| 11 |
|
---|
| 12 |
|
---|
| 13 |
|
---|
| 14 | Status of this Memo
|
---|
| 15 |
|
---|
| 16 | This document is an Internet-Draft and is in full conformance
|
---|
| 17 | with all provisions of Section 10 of RFC2026.
|
---|
| 18 |
|
---|
| 19 | Internet-Drafts are working documents of the Internet Engineering
|
---|
| 20 | Task Force (IETF), its areas, and its working groups. Note that
|
---|
| 21 | other groups may also distribute working documents as
|
---|
| 22 | Internet-Drafts.
|
---|
| 23 |
|
---|
| 24 | Internet-Drafts are draft documents valid for a maximum of six
|
---|
| 25 | months and may be updated, replaced, or obsoleted by other
|
---|
| 26 | documents at any time. It is inappropriate to use Internet-
|
---|
| 27 | Drafts as reference material or to cite them other than as
|
---|
| 28 | "work in progress."
|
---|
| 29 |
|
---|
| 30 | The list of current Internet-Drafts can be accessed at
|
---|
| 31 | http://www.ietf.org/ietf/1id-abstracts.txt
|
---|
| 32 |
|
---|
| 33 | The list of Internet-Draft Shadow Directories can be accessed at
|
---|
| 34 | http://www.ietf.org/shadow.html.
|
---|
| 35 |
|
---|
| 36 |
|
---|
| 37 | Abstract
|
---|
| 38 |
|
---|
| 39 | This memo describes a general, efficient and portable data format
|
---|
| 40 | suitable for encoding compressed and/or differencing data so that
|
---|
| 41 | they can be easily transported among computers.
|
---|
| 42 |
|
---|
| 43 |
|
---|
| 44 | Table of Contents:
|
---|
| 45 |
|
---|
| 46 | 1. EXECUTIVE SUMMARY ............................................ 2
|
---|
| 47 | 2. CONVENTIONS .................................................. 3
|
---|
| 48 | 3. DELTA INSTRUCTIONS ........................................... 4
|
---|
| 49 | 4. DELTA FILE ORGANIZATION ...................................... 5
|
---|
| 50 | 5. DELTA INSTRUCTION ENCODING ................................... 9
|
---|
| 51 | 6. DECODING A TARGET WINDOW ..................................... 14
|
---|
| 52 | 7. APPLICATION-DEFINED CODE TABLES .............................. 16
|
---|
| 53 | 8. PERFORMANCE .................................................. 16
|
---|
| 54 | 9. FURTHER ISSUES ............................................... 17
|
---|
| 55 | 10. SUMMARY ...................................................... 18
|
---|
| 56 | 11. ACKNOWLEDGEMENTS ............................................. 18
|
---|
| 57 | 12. SECURITY CONSIDERATIONS ...................................... 18
|
---|
| 58 | 13. SOURCE CODE AVAILABILITY ..................................... 18
|
---|
| 59 | 14. INTELLECTUAL PROPERTY RIGHTS ................................. 18
|
---|
| 60 | 15. IANA CONSIDERATIONS .......................................... 19
|
---|
| 61 | 16. REFERENCES ................................................... 19
|
---|
| 62 | 17. AUTHOR'S ADDRESS ............................................. 20
|
---|
| 63 |
|
---|
| 64 |
|
---|
| 65 | 1. EXECUTIVE SUMMARY
|
---|
| 66 |
|
---|
| 67 | Compression and differencing techniques can greatly improve storage
|
---|
| 68 | and transmission of files and file versions. Since files are often
|
---|
| 69 | transported across machines with distinct architectures and performance
|
---|
| 70 | characteristics, such data should be encoded in a form that is portable
|
---|
| 71 | and can be decoded with little or no knowledge of the encoders.
|
---|
| 72 | This document describes Vcdiff, a compact portable encoding format
|
---|
| 73 | designed for these purposes.
|
---|
| 74 |
|
---|
| 75 | Data differencing is the process of computing a compact and invertible
|
---|
| 76 | encoding of a "target file" given a "source file". Data compression
|
---|
| 77 | is similar but without the use of source data. The UNIX utilities diff,
|
---|
| 78 | compress, and gzip are well-known examples of data differencing and
|
---|
| 79 | compression tools. For data differencing, the computed encoding is
|
---|
| 80 | called a "delta file", and, for data compression, it is called
|
---|
| 81 | a "compressed file". Delta and compressed files are good for storage
|
---|
| 82 | and transmission as they are often smaller than the originals.
|
---|
| 83 |
|
---|
| 84 | Data differencing and data compression are traditionally treated
|
---|
| 85 | as distinct types of data processing. However, as shown in the Vdelta
|
---|
| 86 | technique by Korn and Vo [1], compression can be thought of as a special
|
---|
| 87 | case of differencing in which the source data is empty. The basic idea
|
---|
| 88 | is to unify the string parsing scheme used in the Lempel-Ziv'77 style
|
---|
| 89 | compressors [2], and the block-move technique of Tichy [3]. Loosely
|
---|
| 90 | speaking, this works as follows:
|
---|
| 91 |
|
---|
| 92 | a. Concatenate source and target data.
|
---|
| 93 | b. Parse the data from left to right as in LZ'77 but
|
---|
| 94 | make sure that a parsed segment starts the target data.
|
---|
| 95 | c. Start to output when reaching target data.
|
---|
| 96 |
|
---|
| 97 | Parsing is based on string matching algorithms such as suffix trees [4]
|
---|
| 98 | or hashing with different time and space performance characteristics.
|
---|
| 99 | Vdelta uses a fast string matching algorithm that requires less memory
|
---|
| 100 | than other techniques [5,6]. However, even with this algorithm, the
|
---|
| 101 | memory requirement can still be prohibitive for large files. A common
|
---|
| 102 | way to deal with memory limitation is to partition an input file into
|
---|
| 103 | chunks called "windows" and process them separately. Here, except for
|
---|
| 104 | unpublished work by Vo, little has been done on designing effective
|
---|
| 105 | windowing schemes. Current techniques, including Vdelta, simply use
|
---|
| 106 | source and target windows with corresponding addresses across source
|
---|
| 107 | and target files.
|
---|
| 108 |
|
---|
| 109 | String matching and windowing algorithms have large influence on the
|
---|
| 110 | compression rate of delta and compressed files. However, it is desirable
|
---|
| 111 | to have a portable encoding format that is independent of such algorithms.
|
---|
| 112 | This enables construction of client-server applications in which a server
|
---|
| 113 | may serve clients with unknown computing characteristics. Unfortunately,
|
---|
| 114 | all current differencing and compressing tools, including Vdelta, fall
|
---|
| 115 | short in this respect. Their storage formats are closely intertwined
|
---|
| 116 | with the implemented string matching and/or windowing algorithms.
|
---|
| 117 |
|
---|
| 118 | The encoding format Vcdiff proposed here addresses the above issues.
|
---|
| 119 | Vcdiff achieves the below characteristics:
|
---|
| 120 |
|
---|
| 121 | Output compactness:
|
---|
| 122 | The basic encoding format compactly represents compressed or delta
|
---|
| 123 | files. Applications can further extend the basic encoding format
|
---|
| 124 | with "secondary encoders" to achieve more compression.
|
---|
| 125 |
|
---|
| 126 | Data portability:
|
---|
| 127 | The basic encoding format is free from machine byte order and
|
---|
| 128 | word size issues. This allows data to be encoded on one machine
|
---|
| 129 | and decoded on a different machine with different architecture.
|
---|
| 130 |
|
---|
| 131 | Algorithm genericity:
|
---|
| 132 | The decoding algorithm is independent from string matching and
|
---|
| 133 | windowing algorithms. This allows competition among implementations
|
---|
| 134 | of the encoder while keeping the same decoder.
|
---|
| 135 |
|
---|
| 136 | Decoding efficiency:
|
---|
| 137 | Except for secondary encoder issues, the decoding algorithm runs
|
---|
| 138 | in time proportional to the size of the target file and uses space
|
---|
| 139 | proportional to the maximal window size. Vcdiff differs from more
|
---|
| 140 | conventional compressors in that it uses only byte-aligned
|
---|
| 141 | data, thus avoiding bit-level operations, which improves
|
---|
| 142 | decoding speed at the slight cost of compression efficiency.
|
---|
| 143 |
|
---|
| 144 | The Vcdiff data format and the algorithms for decoding data shall be
|
---|
| 145 | described next. Since Vcdiff treats compression as a special case of
|
---|
| 146 | differencing, we shall use the term "delta file" to indicate the
|
---|
| 147 | compressed output for both cases.
|
---|
| 148 |
|
---|
| 149 |
|
---|
| 150 | 2. CONVENTIONS
|
---|
| 151 |
|
---|
| 152 | The basic data unit is a byte. For portability, Vcdiff shall limit
|
---|
| 153 | a byte to its lower eight bits even on machines with larger bytes.
|
---|
| 154 | The bits in a byte are ordered from right to left so that the least
|
---|
| 155 | significant bit (LSB) has value 1, and the most significant bit (MSB),
|
---|
| 156 | has value 128.
|
---|
| 157 |
|
---|
| 158 | For purposes of exposition in this document, we adopt the convention
|
---|
| 159 | that the LSB is numbered 0, and the MSB is numbered 7. Bit numbers
|
---|
| 160 | never appear in the encoded format itself.
|
---|
| 161 |
|
---|
| 162 | Vcdiff encodes unsigned integer values using a portable variable-sized
|
---|
| 163 | format (originally introduced in the Sfio library [7]). This encoding
|
---|
| 164 | treats an integer as a number in base 128. Then, each digit in this
|
---|
| 165 | representation is encoded in the lower seven bits of a byte. Except for
|
---|
| 166 | the least significant byte, other bytes have their most significant bit
|
---|
| 167 | turned on to indicate that there are still more digits in the encoding.
|
---|
| 168 | The two key properties of this integer encoding that are beneficial
|
---|
| 169 | to a data compression format are:
|
---|
| 170 |
|
---|
| 171 | a. The encoding is portable among systems using 8-bit bytes, and
|
---|
| 172 | b. Small values are encoded compactly.
|
---|
| 173 |
|
---|
| 174 | For example, consider the value 123456789 which can be represented with
|
---|
| 175 | four 7-bit digits whose values are 58, 111, 26, 21 in order from most
|
---|
| 176 | to least significant. Below is the 8-bit byte encoding of these digits.
|
---|
| 177 | Note that the MSBs of 58, 111 and 26 are on.
|
---|
| 178 |
|
---|
| 179 | +-------------------------------------------+
|
---|
| 180 | | 10111010 | 11101111 | 10011010 | 00010101 |
|
---|
| 181 | +-------------------------------------------+
|
---|
| 182 | MSB+58 MSB+111 MSB+26 0+21
|
---|
| 183 |
|
---|
| 184 |
|
---|
| 185 | Henceforth, the terms "byte" and "integer" will refer to a byte and an
|
---|
| 186 | unsigned integer as described.
|
---|
| 187 |
|
---|
| 188 |
|
---|
| 189 | From time to time, algorithms are exhibited to clarify the descriptions
|
---|
| 190 | of parts of the Vcdiff format. On such occasions, the C language will be
|
---|
| 191 | used to make precise the algorithms. The C code shown in this
|
---|
| 192 | document is meant for clarification only, and is not part of the
|
---|
| 193 | actual specification of the Vcdiff format.
|
---|
| 194 |
|
---|
| 195 | In this specification, the key words "MUST", "MUST NOT",
|
---|
| 196 | "SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as
|
---|
| 197 | described in RFC2119 [12].
|
---|
| 198 |
|
---|
| 199 |
|
---|
| 200 | 3. DELTA INSTRUCTIONS
|
---|
| 201 |
|
---|
| 202 | A large target file is partitioned into non-overlapping sections
|
---|
| 203 | called "target windows". These target windows are processed separately
|
---|
| 204 | and sequentially based on their order in the target file.
|
---|
| 205 |
|
---|
| 206 | A target window T of length t may be compared against some source data
|
---|
| 207 | segment S of length s. By construction, this source data segment S
|
---|
| 208 | comes either from the source file, if one is used, or from a part of
|
---|
| 209 | the target file earlier than T. In this way, during decoding, S is
|
---|
| 210 | completely known when T is being decoded.
|
---|
| 211 |
|
---|
| 212 | The choices of T, t, S and s are made by some window selection algorithm
|
---|
| 213 | which can greatly affect the size of the encoding. However, as seen later,
|
---|
| 214 | these choices are encoded so that no knowledge of the window selection
|
---|
| 215 | algorithm is needed during decoding.
|
---|
| 216 |
|
---|
| 217 | Assume that S[j] represents the jth byte in S, and T[k] represents
|
---|
| 218 | the kth byte in T. Then, for the delta instructions, we treat the data
|
---|
| 219 | windows S and T as substrings of a superstring U formed by concatenating
|
---|
| 220 | them like this:
|
---|
| 221 |
|
---|
| 222 | S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]
|
---|
| 223 |
|
---|
| 224 | The "address" of a byte in S or T is referred to by its location in U.
|
---|
| 225 | For example, the address of T[k] is s+k.
|
---|
| 226 |
|
---|
| 227 | The instructions to encode and direct the reconstruction of a target
|
---|
| 228 | window are called delta instructions. There are three types:
|
---|
| 229 |
|
---|
| 230 | ADD: This instruction has two arguments, a size x and a sequence of
|
---|
| 231 | x bytes to be copied.
|
---|
| 232 | COPY: This instruction has two arguments, a size x and an address p
|
---|
| 233 | in the string U. The arguments specify the substring of U that
|
---|
| 234 | must be copied. We shall assert that such a substring must be
|
---|
| 235 | entirely contained in either S or T.
|
---|
| 236 | RUN: This instruction has two arguments, a size x and a byte b that
|
---|
| 237 | will be repeated x times.
|
---|
| 238 |
|
---|
| 239 | Below are example source and target windows and the delta instructions
|
---|
| 240 | that encode the target window in terms of the source window.
|
---|
| 241 |
|
---|
| 242 | a b c d e f g h i j k l m n o p
|
---|
| 243 | a b c d w x y z e f g h e f g h e f g h e f g h z z z z
|
---|
| 244 |
|
---|
| 245 | COPY 4, 0
|
---|
| 246 | ADD 4, w x y z
|
---|
| 247 | COPY 4, 4
|
---|
| 248 | COPY 12, 24
|
---|
| 249 | RUN 4, z
|
---|
| 250 |
|
---|
| 251 |
|
---|
| 252 | Thus, the first letter 'a' in the target window is at location 16
|
---|
| 253 | in the superstring. Note that the fourth instruction, "COPY 12, 24",
|
---|
| 254 | copies data from T itself since address 24 is position 8 in T.
|
---|
| 255 | This instruction also shows that it is fine to overlap the data to be
|
---|
| 256 | copied with the data being copied from as long as the latter starts
|
---|
| 257 | earlier. This enables efficient encoding of periodic sequences,
|
---|
| 258 | i.e., sequences with regularly repeated subsequences. The RUN instruction
|
---|
| 259 | is a compact way to encode a sequence repeating the same byte even though
|
---|
| 260 | such a sequence can be thought of as a periodic sequence with period 1.
|
---|
| 261 |
|
---|
| 262 | To reconstruct the target window, one simply processes one delta
|
---|
| 263 | instruction at a time and copy the data either from the source window
|
---|
| 264 | or the being reconstructed target window based on the type of the
|
---|
| 265 | instruction and the associated address, if any.
|
---|
| 266 |
|
---|
| 267 |
|
---|
| 268 | 4. DELTA FILE ORGANIZATION
|
---|
| 269 |
|
---|
| 270 | A Vcdiff delta file starts with a Header section followed by a sequence
|
---|
| 271 | of Window sections. The Header section includes magic bytes to identify
|
---|
| 272 | the file type, and information concerning data processing beyond the
|
---|
| 273 | basic encoding format. The Window sections encode the target windows.
|
---|
| 274 |
|
---|
| 275 | Below is the overall organization of a delta file. The indented items
|
---|
| 276 | refine the ones immediately above them. An item in square brackets may
|
---|
| 277 | or may not be present in the file depending on the information encoded
|
---|
| 278 | in the Indicator byte above it.
|
---|
| 279 |
|
---|
| 280 | Header
|
---|
| 281 | Header1 - byte
|
---|
| 282 | Header2 - byte
|
---|
| 283 | Header3 - byte
|
---|
| 284 | Header4 - byte
|
---|
| 285 | Hdr_Indicator - byte
|
---|
| 286 | [Secondary compressor ID] - byte
|
---|
| 287 |
|
---|
| 288 | [@@@ Why is compressor ID not an integer? ]
|
---|
| 289 | [@@@ If we aren't defining any secondary compressors yet, then it seems
|
---|
| 290 | that defining the [Secondary compressor ID] and the corresponding
|
---|
| 291 | VCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value. An
|
---|
| 292 | implementation of this specification won't be able to decode a VCDIFF
|
---|
| 293 | encoded with this option if it doesn't know about any secondary
|
---|
| 294 | compressors. It seems that you should specify the bits related to
|
---|
| 295 | secondary compressors once you have defined the first a secondary
|
---|
| 296 | compressor. I can imagine a secondary-compressor might want to supply
|
---|
| 297 | extra information, such as a dictionary of some kind, in which case
|
---|
| 298 | this speculative treatment wouldn't go far enough.]
|
---|
| 299 |
|
---|
| 300 | [Length of code table data] - integer
|
---|
| 301 | [Code table data]
|
---|
| 302 | Size of near cache - byte
|
---|
| 303 | Size of same cache - byte
|
---|
| 304 | Compressed code table data
|
---|
| 305 | Window1
|
---|
| 306 | Win_Indicator - byte
|
---|
| 307 | [Source segment size] - integer
|
---|
| 308 | [Source segment position] - integer
|
---|
| 309 | The delta encoding of the target window
|
---|
| 310 | Length of the delta encoding - integer
|
---|
| 311 | The delta encoding
|
---|
| 312 | Size of the target window - integer
|
---|
| 313 | Delta_Indicator - byte
|
---|
| 314 | Length of data for ADDs and RUNs - integer
|
---|
| 315 | Length of instructions and sizes - integer
|
---|
| 316 | Length of addresses for COPYs - integer
|
---|
| 317 | Data section for ADDs and RUNs - array of bytes
|
---|
| 318 | Instructions and sizes section - array of bytes
|
---|
| 319 | Addresses section for COPYs - array of bytes
|
---|
| 320 | Window2
|
---|
| 321 | ...
|
---|
| 322 |
|
---|
| 323 |
|
---|
| 324 |
|
---|
| 325 | 4.1 The Header Section
|
---|
| 326 |
|
---|
| 327 | Each delta file starts with a header section organized as below.
|
---|
| 328 | Note the convention that square-brackets enclose optional items.
|
---|
| 329 |
|
---|
| 330 | Header1 - byte = 0xE6
|
---|
| 331 | Header2 - byte = 0xD3
|
---|
| 332 | Header3 - byte = 0xD4
|
---|
| 333 |
|
---|
| 334 | HMMM
|
---|
| 335 |
|
---|
| 336 | 0xD6
|
---|
| 337 | 0xC3
|
---|
| 338 | 0xC4
|
---|
| 339 |
|
---|
| 340 | Header4 - byte
|
---|
| 341 | Hdr_Indicator - byte
|
---|
| 342 | [Secondary compressor ID] - byte
|
---|
| 343 | [Length of code table data] - integer
|
---|
| 344 | [Code table data]
|
---|
| 345 |
|
---|
| 346 | The first three Header bytes are the ASCII characters 'V', 'C' and 'D'
|
---|
| 347 | with their most significant bits turned on (in hexadecimal, the values
|
---|
| 348 | are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to
|
---|
| 349 | zero. In the future, it might be used to indicate the version of Vcdiff.
|
---|
| 350 |
|
---|
| 351 | The Hdr_Indicator byte shows if there are any initialization data
|
---|
| 352 | required to aid in the reconstruction of data in the Window sections.
|
---|
| 353 | This byte MAY have non-zero values for either, both, or neither of
|
---|
| 354 | the two bits VCD_DECOMPRESS and VCD_CODETABLE below:
|
---|
| 355 |
|
---|
| 356 | 7 6 5 4 3 2 1 0
|
---|
| 357 | +-+-+-+-+-+-+-+-+
|
---|
| 358 | | | | | | | | | |
|
---|
| 359 | +-+-+-+-+-+-+-+-+
|
---|
| 360 | ^ ^
|
---|
| 361 | | |
|
---|
| 362 | | +-- VCD_DECOMPRESS
|
---|
| 363 | +---- VCD_CODETABLE
|
---|
| 364 |
|
---|
| 365 | If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary
|
---|
| 366 | compressor may have been used to further compress certain parts of the
|
---|
| 367 | delta encoding data as described in Sections 4.3 and 6. In that case,
|
---|
| 368 | the ID of the secondary compressor is given next. If this bit is zero,
|
---|
| 369 | the compressor ID byte is not included.
|
---|
| 370 |
|
---|
| 371 | [@@@ If we aren't defining any secondary compressors yet, then it seems
|
---|
| 372 | this bit has no real value yet..]
|
---|
| 373 |
|
---|
| 374 | If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an
|
---|
| 375 | application-defined code table is to be used for decoding the delta
|
---|
| 376 | instructions. This table itself is compressed. The length of the data
|
---|
| 377 | comprising this compressed code table and the data follow next. Section 7
|
---|
| 378 | discusses application-defined code tables. If this bit is zero, the code
|
---|
| 379 | table data length and the code table data are not included.
|
---|
| 380 |
|
---|
| 381 | If both bits are set, then the compressor ID byte is included
|
---|
| 382 | before the code table data length and the code table data.
|
---|
| 383 |
|
---|
| 384 |
|
---|
| 385 | 4.2 The Format of a Window Section
|
---|
| 386 |
|
---|
| 387 | Each Window section is organized as follows:
|
---|
| 388 |
|
---|
| 389 | Win_Indicator - byte
|
---|
| 390 | [Source segment length] - integer
|
---|
| 391 | [Source segment position] - integer
|
---|
| 392 | The delta encoding of the target window
|
---|
| 393 |
|
---|
| 394 |
|
---|
| 395 | Below are the detail of the various items:
|
---|
| 396 |
|
---|
| 397 | [@@@ Here, I want to replace the Win_Indicator with a source-count,
|
---|
| 398 | followed by source-count length/position pairs?]
|
---|
| 399 |
|
---|
| 400 | Win_Indicator:
|
---|
| 401 | This byte is a set of bits, as shown:
|
---|
| 402 |
|
---|
| 403 | 7 6 5 4 3 2 1 0
|
---|
| 404 | +-+-+-+-+-+-+-+-+
|
---|
| 405 | | | | | | | | | |
|
---|
| 406 | +-+-+-+-+-+-+-+-+
|
---|
| 407 | ^ ^
|
---|
| 408 | | |
|
---|
| 409 | | +-- VCD_SOURCE
|
---|
| 410 | +---- VCD_TARGET
|
---|
| 411 |
|
---|
| 412 |
|
---|
| 413 | If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment
|
---|
| 414 | of data from the "source" file was used as the corresponding
|
---|
| 415 | source window of data to encode the target window. The decoder
|
---|
| 416 | will use this same source data segment to decode the target window.
|
---|
| 417 |
|
---|
| 418 | If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment
|
---|
| 419 | of data from the "target" file was used as the corresponding
|
---|
| 420 | source window of data to encode the target window. As above, this
|
---|
| 421 | same source data segment is used to decode the target window.
|
---|
| 422 |
|
---|
| 423 | The Win_Indicator byte MUST NOT have more than one of the bits
|
---|
| 424 | set (non-zero). It MAY have none of these bits set.
|
---|
| 425 |
|
---|
| 426 | If one of these bits is set, the byte is followed by two
|
---|
| 427 | integers to indicate respectively the length and position of
|
---|
| 428 | the source data segment in the relevant file. If the
|
---|
| 429 | indicator byte is zero, the target window was compressed
|
---|
| 430 | by itself without comparing against another data segment,
|
---|
| 431 | and these two integers are not included.
|
---|
| 432 |
|
---|
| 433 | The delta encoding of the target window:
|
---|
| 434 | This contains the delta encoding of the target window either
|
---|
| 435 | in terms of the source data segment (i.e., VCD_SOURCE
|
---|
| 436 | or VCD_TARGET was set) or by itself if no source window
|
---|
| 437 | is specified. This data format is discussed next.
|
---|
| 438 |
|
---|
| 439 |
|
---|
| 440 | 4.3 The Delta Encoding of a Target Window
|
---|
| 441 |
|
---|
| 442 | The delta encoding of a target window is organized as follows:
|
---|
| 443 |
|
---|
| 444 | Length of the delta encoding - integer
|
---|
| 445 | The delta encoding
|
---|
| 446 | Length of the target window - integer
|
---|
| 447 | Delta_Indicator - byte
|
---|
| 448 | Length of data for ADDs and RUNs - integer
|
---|
| 449 | Length of instructions section - integer
|
---|
| 450 | Length of addresses for COPYs - integer
|
---|
| 451 | Data section for ADDs and RUNs - array of bytes
|
---|
| 452 | Instructions and sizes section - array of bytes
|
---|
| 453 | Addresses section for COPYs - array of bytes
|
---|
| 454 |
|
---|
| 455 |
|
---|
| 456 | Length of the delta encoding:
|
---|
| 457 | This integer gives the total number of remaining bytes that
|
---|
| 458 | comprise data of the delta encoding for this target window.
|
---|
| 459 |
|
---|
| 460 | The delta encoding:
|
---|
| 461 | This contains the data representing the delta encoding which
|
---|
| 462 | is described next.
|
---|
| 463 |
|
---|
| 464 | Length of the target window:
|
---|
| 465 | This integer indicates the actual size of the target window
|
---|
| 466 | after decompression. A decoder can use this value to allocate
|
---|
| 467 | memory to store the uncompressed data.
|
---|
| 468 |
|
---|
| 469 | Delta_Indicator:
|
---|
| 470 | This byte is a set of bits, as shown:
|
---|
| 471 |
|
---|
| 472 | 7 6 5 4 3 2 1 0
|
---|
| 473 | +-+-+-+-+-+-+-+-+
|
---|
| 474 | | | | | | | | | |
|
---|
| 475 | +-+-+-+-+-+-+-+-+
|
---|
| 476 | ^ ^ ^
|
---|
| 477 | | | |
|
---|
| 478 | | | +-- VCD_DATACOMP
|
---|
| 479 | | +---- VCD_INSTCOMP
|
---|
| 480 | +------ VCD_ADDRCOMP
|
---|
| 481 |
|
---|
| 482 | VCD_DATACOMP: bit value 1.
|
---|
| 483 | VCD_INSTCOMP: bit value 2.
|
---|
| 484 | VCD_ADDRCOMP: bit value 4.
|
---|
| 485 |
|
---|
| 486 | As discussed, the delta encoding consists of COPY, ADD and RUN
|
---|
| 487 | instructions. The ADD and RUN instructions have accompanying
|
---|
| 488 | unmatched data (that is, data that does not specifically match
|
---|
| 489 | any data in the source window or in some earlier part of the
|
---|
| 490 | target window) and the COPY instructions have addresses of where
|
---|
| 491 | the matches occur. OPTIONALLY, these types of data MAY be further
|
---|
| 492 | compressed using a secondary compressor. Thus, Vcdiff separates
|
---|
| 493 | the encoding of the delta instructions into three parts:
|
---|
| 494 |
|
---|
| 495 | a. The unmatched data in the ADD and RUN instructions,
|
---|
| 496 | b. The delta instructions and accompanying sizes, and
|
---|
| 497 | c. The addresses of the COPY instructions.
|
---|
| 498 |
|
---|
| 499 | If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these
|
---|
| 500 | sections may have been compressed using the specified secondary
|
---|
| 501 | compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP),
|
---|
| 502 | and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that
|
---|
| 503 | the corresponding parts are compressed. Then, these parts MUST
|
---|
| 504 | be decompressed before decoding the delta instructions.
|
---|
| 505 |
|
---|
| 506 | Length of data for ADDs and RUNs:
|
---|
| 507 | This is the length (in bytes) of the section of data storing
|
---|
| 508 | the unmatched data accompanying the ADD and RUN instructions.
|
---|
| 509 |
|
---|
| 510 | Length of instructions section:
|
---|
| 511 | This is the length (in bytes) of the delta instructions and
|
---|
| 512 | accompanying sizes.
|
---|
| 513 |
|
---|
| 514 | Length of addresses for COPYs:
|
---|
| 515 | This is the length (in bytes) of the section storing
|
---|
| 516 | the addresses of the COPY instructions.
|
---|
| 517 |
|
---|
| 518 | Data section for ADDs and RUNs:
|
---|
| 519 | This sequence of bytes encodes the unmatched data for the ADD
|
---|
| 520 | and RUN instructions.
|
---|
| 521 |
|
---|
| 522 | Instructions and sizes section:
|
---|
| 523 | This sequence of bytes encodes the instructions and their sizes.
|
---|
| 524 |
|
---|
| 525 | Addresses section for COPYs:
|
---|
| 526 | This sequence of bytes encodes the addresses of the COPY
|
---|
| 527 | instructions.
|
---|
| 528 |
|
---|
| 529 |
|
---|
| 530 | 5. DELTA INSTRUCTION ENCODING
|
---|
| 531 |
|
---|
| 532 | The delta instructions described in Section 3 represent the results of
|
---|
| 533 | string matching. For many data differencing applications in which the
|
---|
| 534 | changes between source and target data are small, any straightforward
|
---|
| 535 | representation of these instructions would be adequate. However, for
|
---|
| 536 | applications including data compression, it is important to encode
|
---|
| 537 | these instructions well to achieve good compression rates. From our
|
---|
| 538 | experience, the following observations can be made:
|
---|
| 539 |
|
---|
| 540 | a. The addresses in COPY instructions are locations of matches and
|
---|
| 541 | often occur close by or even exactly equal to one another. This is
|
---|
| 542 | because data in local regions are often replicated with minor changes.
|
---|
| 543 | In turn, this means that coding a newly matched address against some
|
---|
| 544 | set of recently matched addresses can be beneficial.
|
---|
| 545 |
|
---|
| 546 | b. The matches are often short in length and separated by small amounts
|
---|
| 547 | of unmatched data. That is, the lengths of COPY and ADD instructions
|
---|
| 548 | are often small. This is particularly true of binary data such as
|
---|
| 549 | executable files or structured data such as HTML or XML. In such cases,
|
---|
| 550 | compression can be improved by combining the encoding of the sizes
|
---|
| 551 | and the instruction types as well as combining the encoding of adjacent
|
---|
| 552 | delta instructions with sufficiently small data sizes.
|
---|
| 553 |
|
---|
| 554 | The below subsections discuss how the Vcdiff data format provides
|
---|
| 555 | mechanisms enabling encoders to use the above observations to improve
|
---|
| 556 | compression rates.
|
---|
| 557 |
|
---|
| 558 |
|
---|
| 559 | 5.1 Address Encoding Modes of COPY Instructions
|
---|
| 560 |
|
---|
| 561 | As mentioned earlier, addresses of COPY instructions often occur close
|
---|
| 562 | to one another or are exactly equal. To take advantage of this phenomenon
|
---|
| 563 | and encode addresses of COPY instructions more efficiently, the Vcdiff
|
---|
| 564 | data format supports the use of two different types of address caches.
|
---|
| 565 | Both the encoder and decoder maintain these caches, so that decoder's
|
---|
| 566 | caches remain synchronized with the encoder's caches.
|
---|
| 567 |
|
---|
| 568 | a. A "near" cache is an array with "s_near" slots, each containing an
|
---|
| 569 | address used for encoding addresses nearby to previously encoded
|
---|
| 570 | addresses (in the positive direction only). The near cache also
|
---|
| 571 | maintains a "next_slot" index to the near cache. New entries to the
|
---|
| 572 | near cache are always inserted in the next_slot index, which maintains
|
---|
| 573 | a circular buffer of the s_near most recent addresses.
|
---|
| 574 |
|
---|
| 575 | b. A "same" cache is an array with "s_same" multiple of 256 slots, each
|
---|
| 576 | containing an address. The same cache maintains a hash table of recent
|
---|
| 577 | addresses used for repeated encoding of the exact same address.
|
---|
| 578 |
|
---|
| 579 |
|
---|
| 580 | By default, the parameters s_near and s_same are respectively set to 4
|
---|
| 581 | and 3. An encoder MAY modify these values, but then it MUST encode the
|
---|
| 582 | new values in the encoding itself, as discussed in Section 7, so that
|
---|
| 583 | the decoder can properly set up its own caches.
|
---|
| 584 |
|
---|
| 585 | At the start of processing a target window, an implementation
|
---|
| 586 | (encoder or decoder) initializes all of the slots in both caches
|
---|
| 587 | to zero. The next_slot pointer of the near cache is set
|
---|
| 588 | to point to slot zero.
|
---|
| 589 |
|
---|
| 590 | Each time a COPY instruction is processed by the encoder or
|
---|
| 591 | decoder, the implementation's caches are updated as follows, where
|
---|
| 592 | "addr" is the address in the COPY instruction.
|
---|
| 593 |
|
---|
| 594 | a. The slot in the near cache referenced by the next_slot
|
---|
| 595 | index is set to addr. The next_slot index is then incremented
|
---|
| 596 | modulo s_near.
|
---|
| 597 |
|
---|
| 598 | b. The slot in the same cache whose index is addr%(s_same*256)
|
---|
| 599 | is set to addr. [We use the C notations of % for modulo and
|
---|
| 600 | * for multiplication.]
|
---|
| 601 |
|
---|
| 602 |
|
---|
| 603 | 5.2 Example code for maintaining caches
|
---|
| 604 |
|
---|
| 605 | To make clear the above description, below are example cache data
|
---|
| 606 | structures and algorithms to initialize and update them:
|
---|
| 607 |
|
---|
| 608 | typedef struct _cache_s
|
---|
| 609 | {
|
---|
| 610 | int* near; /* array of size s_near */
|
---|
| 611 | int s_near;
|
---|
| 612 | int next_slot; /* the circular index for near */
|
---|
| 613 | int* same; /* array of size s_same*256 */
|
---|
| 614 | int s_same;
|
---|
| 615 | } Cache_t;
|
---|
| 616 |
|
---|
| 617 | cache_init(Cache_t* ka)
|
---|
| 618 | {
|
---|
| 619 | int i;
|
---|
| 620 |
|
---|
| 621 | ka->next_slot = 0;
|
---|
| 622 | for(i = 0; i < ka->s_near; ++i)
|
---|
| 623 | ka->near[i] = 0;
|
---|
| 624 |
|
---|
| 625 | for(i = 0; i < ka->s_same*256; ++i)
|
---|
| 626 | ka->same[i] = 0;
|
---|
| 627 | }
|
---|
| 628 |
|
---|
| 629 | cache_update(Cache_t* ka, int addr)
|
---|
| 630 | {
|
---|
| 631 | if(ka->s_near > 0)
|
---|
| 632 | { ka->near[ka->next_slot] = addr;
|
---|
| 633 | ka->next_slot = (ka->next_slot + 1) % ka->s_near;
|
---|
| 634 | }
|
---|
| 635 |
|
---|
| 636 | if(ka->s_same > 0)
|
---|
| 637 | ka->same[addr % (ka->s_same*256)] = addr;
|
---|
| 638 | }
|
---|
| 639 |
|
---|
| 640 |
|
---|
| 641 | 5.3 Encoding of COPY instruction addresses
|
---|
| 642 |
|
---|
| 643 | The address of a COPY instruction is encoded using different modes
|
---|
| 644 | depending on the type of cached address used, if any.
|
---|
| 645 |
|
---|
| 646 | Let "addr" be the address of a COPY instruction to be decoded and "here"
|
---|
| 647 | be the current location in the target data (i.e., the start of the data
|
---|
| 648 | about to be encoded or decoded). Let near[j] be the jth element in
|
---|
| 649 | the near cache, and same[k] be the kth element in the same cache.
|
---|
| 650 | Below are the possible address modes:
|
---|
| 651 |
|
---|
| 652 | VCD_SELF: This mode has value 0. The address was encoded by itself
|
---|
| 653 | as an integer.
|
---|
| 654 |
|
---|
| 655 | VCD_HERE: This mode has value 1. The address was encoded as
|
---|
| 656 | the integer value "here - addr".
|
---|
| 657 |
|
---|
| 658 | Near modes: The "near modes" are in the range [2,s_near+1]. Let m
|
---|
| 659 | be the mode of the address encoding. The address was encoded
|
---|
| 660 | as the integer value "addr - near[m-2]".
|
---|
| 661 |
|
---|
| 662 | Same modes: The "same modes" are in the range
|
---|
| 663 | [s_near+2,s_near+s_same+1]. Let m be the mode of the encoding.
|
---|
| 664 | The address was encoded as a single byte b such that
|
---|
| 665 | "addr == same[(m - (s_near+2))*256 + b]".
|
---|
| 666 |
|
---|
| 667 |
|
---|
| 668 | 5.3 Example code for encoding and decoding of COPY instruction addresses
|
---|
| 669 |
|
---|
| 670 | We show example algorithms below to demonstrate use of address modes more
|
---|
| 671 | clearly. The encoder has freedom to choose address modes, the sample
|
---|
| 672 | addr_encode() algorithm merely shows one way of picking the address
|
---|
| 673 | mode. The decoding algorithm addr_decode() will uniquely decode addresses
|
---|
| 674 | regardless of the encoder's algorithm choice.
|
---|
| 675 |
|
---|
| 676 | Note that the address caches are updated immediately after an address is
|
---|
| 677 | encoded or decoded. In this way, the decoder is always synchronized with
|
---|
| 678 | the encoder.
|
---|
| 679 |
|
---|
| 680 | int addr_encode(Cache_t* ka, int addr, int here, int* mode)
|
---|
| 681 | {
|
---|
| 682 | int i, d, bestd, bestm;
|
---|
| 683 |
|
---|
| 684 | /* Attempt to find the address mode that yields the
|
---|
| 685 | * smallest integer value for "d", the encoded address
|
---|
| 686 | * value, thereby minimizing the encoded size of the
|
---|
| 687 | * address. */
|
---|
| 688 |
|
---|
| 689 | bestd = addr; bestm = VCD_SELF; /* VCD_SELF == 0 */
|
---|
| 690 |
|
---|
| 691 | if((d = here-addr) < bestd)
|
---|
| 692 | { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */
|
---|
| 693 |
|
---|
| 694 | for(i = 0; i < ka->s_near; ++i)
|
---|
| 695 | if((d = addr - ka->near[i]) >= 0 && d < bestd)
|
---|
| 696 | { bestd = d; bestm = i+2; }
|
---|
| 697 |
|
---|
| 698 | if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)
|
---|
| 699 | { bestd = d%256; bestm = ka->s_near + 2 + d/256; }
|
---|
| 700 |
|
---|
| 701 | cache_update(ka,addr);
|
---|
| 702 |
|
---|
| 703 | *mode = bestm; /* this returns the address encoding mode */
|
---|
| 704 | return bestd; /* this returns the encoded address */
|
---|
| 705 | }
|
---|
| 706 |
|
---|
| 707 | Note that the addr_encode() algorithm chooses the best address mode using a
|
---|
| 708 | local optimization, but that may not lead to the best encoding efficiency
|
---|
| 709 | because different modes lead to different instruction encodings, as described below.
|
---|
| 710 |
|
---|
| 711 | The functions addrint() and addrbyte() used in addr_decode() obtain from
|
---|
| 712 | the "Addresses section for COPYs" (Section 4.3) an integer or a byte,
|
---|
| 713 | respectively. These utilities will not be described here. We simply
|
---|
| 714 | recall that an integer is represented as a compact variable-sized string
|
---|
| 715 | of bytes as described in Section 2 (i.e., base 128).
|
---|
| 716 |
|
---|
| 717 | int addr_decode(Cache_t* ka, int here, int mode)
|
---|
| 718 | { int addr, m;
|
---|
| 719 |
|
---|
| 720 | if(mode == VCD_SELF)
|
---|
| 721 | addr = addrint();
|
---|
| 722 | else if(mode == VCD_HERE)
|
---|
| 723 | addr = here - addrint();
|
---|
| 724 | else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
|
---|
| 725 | addr = ka->near[m] + addrint();
|
---|
| 726 | else /* same cache */
|
---|
| 727 | { m = mode - (2 + ka->s_near);
|
---|
| 728 | addr = ka->same[m*256 + addrbyte()];
|
---|
| 729 | }
|
---|
| 730 |
|
---|
| 731 | cache_update(ka, addr);
|
---|
| 732 |
|
---|
| 733 | return addr;
|
---|
| 734 | }
|
---|
| 735 |
|
---|
| 736 |
|
---|
| 737 | 5.4 Instruction Codes
|
---|
| 738 |
|
---|
| 739 | As noted, the data sizes associated with delta instructions are often
|
---|
| 740 | small. Thus, compression efficiency can be improved by combining the sizes
|
---|
| 741 | and instruction types in a single encoding, as well by combining certain
|
---|
| 742 | pairs of adjacent delta instructions. Effective choices of when to perform
|
---|
| 743 | such combinations depend on many factors including the data being processed
|
---|
| 744 | and the string matching algorithm in use. For example, if many COPY
|
---|
| 745 | instructions have the same data sizes, it may be worth to encode these
|
---|
| 746 | instructions more compactly than others.
|
---|
| 747 |
|
---|
| 748 | The Vcdiff data format is designed so that a decoder does not need to be
|
---|
| 749 | aware of the choices made in encoding algorithms. This is achieved with the
|
---|
| 750 | notion of an "instruction code table" containing 256 entries. Each entry
|
---|
| 751 | defines either a single delta instruction or a pair of instructions that
|
---|
| 752 | have been combined. Note that the code table itself only exists in main
|
---|
| 753 | memory, not in the delta file (unless using an application-defined code
|
---|
| 754 | table, described in Section 7). The encoded data simply includes the index
|
---|
| 755 | of each instruction and, since there are only 256 indices, each index
|
---|
| 756 | can be represented as a single byte.
|
---|
| 757 |
|
---|
| 758 | Each instruction code entry contains six fields, each of which
|
---|
| 759 | is a single byte with unsigned value:
|
---|
| 760 |
|
---|
| 761 | +-----------------------------------------------+
|
---|
| 762 | | inst1 | size1 | mode1 | inst2 | size2 | mode2 |
|
---|
| 763 | +-----------------------------------------------+
|
---|
| 764 |
|
---|
| 765 | @@@ could be more compact
|
---|
| 766 |
|
---|
| 767 | Each triple (inst,size,mode) defines a delta instruction. The meanings
|
---|
| 768 | of these fields are as follows:
|
---|
| 769 |
|
---|
| 770 | inst: An "inst" field can have one of the four values: NOOP (0), ADD (1),
|
---|
| 771 | RUN (2) or COPY (3) to indicate the instruction types. NOOP means
|
---|
| 772 | that no instruction is specified. In this case, both the corresponding
|
---|
| 773 | size and mode fields will be zero.
|
---|
| 774 |
|
---|
| 775 | size: A "size" field is zero or positive. A value zero means that the
|
---|
| 776 | size associated with the instruction is encoded separately as
|
---|
| 777 | an integer in the "Instructions and sizes section" (Section 6).
|
---|
| 778 | A positive value for "size" defines the actual data size.
|
---|
| 779 | Note that since the size is restricted to a byte, the maximum
|
---|
| 780 | value for any instruction with size implicitly defined in the code
|
---|
| 781 | table is 255.
|
---|
| 782 |
|
---|
| 783 | mode: A "mode" field is significant only when the associated delta
|
---|
| 784 | instruction is a COPY. It defines the mode used to encode the
|
---|
| 785 | associated addresses. For other instructions, this is always zero.
|
---|
| 786 |
|
---|
| 787 |
|
---|
| 788 | 5.5 The Code Table
|
---|
| 789 |
|
---|
| 790 | Following the discussions on address modes and instruction code tables,
|
---|
| 791 | we define a "Code Table" to have the data below:
|
---|
| 792 |
|
---|
| 793 | s_near: the size of the near cache,
|
---|
| 794 | s_same: the size of the same cache,
|
---|
| 795 | i_code: the 256-entry instruction code table.
|
---|
| 796 |
|
---|
| 797 | Vcdiff itself defines a "default code table" in which s_near is 4
|
---|
| 798 | and s_same is 3. Thus, there are 9 address modes for a COPY instruction.
|
---|
| 799 | The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5
|
---|
| 800 | are for addresses coded against the near cache. And, modes 6, 7 and 8
|
---|
| 801 | are for addresses coded against the same cache.
|
---|
| 802 |
|
---|
| 803 | The default instruction code table is depicted below, in a compact
|
---|
| 804 | representation that we use only for descriptive purposes. See section 7
|
---|
| 805 | for the specification of how an instruction code table is represented
|
---|
| 806 | in the Vcdiff encoding format. In the depiction, a zero value for
|
---|
| 807 | size indicates that the size is separately coded. The mode of non-COPY
|
---|
| 808 | instructions is represented as 0 even though they are not used.
|
---|
| 809 |
|
---|
| 810 |
|
---|
| 811 | TYPE SIZE MODE TYPE SIZE MODE INDEX
|
---|
| 812 | ---------------------------------------------------------------
|
---|
| 813 | 1. RUN 0 0 NOOP 0 0 0
|
---|
| 814 | 2. ADD 0, [1,17] 0 NOOP 0 0 [1,18]
|
---|
| 815 | 3. COPY 0, [4,18] 0 NOOP 0 0 [19,34]
|
---|
| 816 | 4. COPY 0, [4,18] 1 NOOP 0 0 [35,50]
|
---|
| 817 | 5. COPY 0, [4,18] 2 NOOP 0 0 [51,66]
|
---|
| 818 | 6. COPY 0, [4,18] 3 NOOP 0 0 [67,82]
|
---|
| 819 | 7. COPY 0, [4,18] 4 NOOP 0 0 [83,98]
|
---|
| 820 | 8. COPY 0, [4,18] 5 NOOP 0 0 [99,114]
|
---|
| 821 | 9. COPY 0, [4,18] 6 NOOP 0 0 [115,130]
|
---|
| 822 | 10. COPY 0, [4,18] 7 NOOP 0 0 [131,146]
|
---|
| 823 | 11. COPY 0, [4,18] 8 NOOP 0 0 [147,162]
|
---|
| 824 | 12. ADD [1,4] 0 COPY [4,6] 0 [163,174]
|
---|
| 825 | 13. ADD [1,4] 0 COPY [4,6] 1 [175,186]
|
---|
| 826 | 14. ADD [1,4] 0 COPY [4,6] 2 [187,198]
|
---|
| 827 | 15. ADD [1,4] 0 COPY [4,6] 3 [199,210]
|
---|
| 828 | 16. ADD [1,4] 0 COPY [4,6] 4 [211,222]
|
---|
| 829 | 17. ADD [1,4] 0 COPY [4,6] 5 [223,234]
|
---|
| 830 | 18. ADD [1,4] 0 COPY 4 6 [235,238]
|
---|
| 831 | 19. ADD [1,4] 0 COPY 4 7 [239,242]
|
---|
| 832 | 20. ADD [1,4] 0 COPY 4 8 [243,246]
|
---|
| 833 | 21. COPY 4 [0,8] ADD 1 0 [247,255]
|
---|
| 834 | ---------------------------------------------------------------
|
---|
| 835 |
|
---|
| 836 | In the above depiction, each numbered line represents one or more
|
---|
| 837 | entries in the actual instruction code table (recall that an entry in
|
---|
| 838 | the instruction code table may represent up to two combined delta
|
---|
| 839 | instructions.) The last column ("INDEX") shows which index value or
|
---|
| 840 | range of index values of the entries covered by that line. The notation
|
---|
| 841 | [i,j] means values from i through j, inclusive. The first 6 columns of
|
---|
| 842 | a line in the depiction describe the pairs of instructions used for
|
---|
| 843 | the corresponding index value(s).
|
---|
| 844 |
|
---|
| 845 | If a line in the depiction includes a column entry using the [i,j]
|
---|
| 846 | notation, this means that the line is instantiated for each value
|
---|
| 847 | in the range from i to j, inclusive. The notation "0, [i,j]" means
|
---|
| 848 | that the line is instantiated for the value 0 and for each value
|
---|
| 849 | in the range from i to j, inclusive.
|
---|
| 850 |
|
---|
| 851 | If a line in the depiction includes more than one entry using the [i,j]
|
---|
| 852 | notation, implying a "nested loop" to convert the line to a range of
|
---|
| 853 | table entries, the first such [i,j] range specifies the outer loop,
|
---|
| 854 | and the second specifies the inner loop.
|
---|
| 855 |
|
---|
| 856 | The below examples should make clear the above description:
|
---|
| 857 |
|
---|
| 858 | Line 1 shows the single RUN instruction with index 0. As the size field
|
---|
| 859 | is 0, this RUN instruction always has its actual size encoded separately.
|
---|
| 860 |
|
---|
| 861 | Line 2 shows the 18 single ADD instructions. The ADD instruction with
|
---|
| 862 | size field 0 (i.e., the actual size is coded separately) has index 1.
|
---|
| 863 | ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and
|
---|
| 864 | their sizes are as given (so they will not be separately encoded.)
|
---|
| 865 |
|
---|
| 866 | Following the single ADD instructions are the single COPY instructions
|
---|
| 867 | ordered by their address encoding modes. For example, line 11 shows the
|
---|
| 868 | COPY instructions with mode 8, i.e., the last of the same cache.
|
---|
| 869 | In this case, the COPY instruction with size field 0 has index 147.
|
---|
| 870 | Again, the actual size of this instruction will be coded separately.
|
---|
| 871 |
|
---|
| 872 | Lines 12 to 21 show the pairs of instructions that are combined together.
|
---|
| 873 | For example, line 12 depicts the 12 entries in which an ADD instruction
|
---|
| 874 | is combined with an immediately following COPY instruction. The entries
|
---|
| 875 | with indices 163, 164, 165 represent the pairs in which the ADD
|
---|
| 876 | instructions all have size 1 while the COPY instructions has mode
|
---|
| 877 | 0 (VCD_SELF) and sizes 4, 5 and 6 respectively.
|
---|
| 878 |
|
---|
| 879 | The last line, line 21, shows the eight instruction pairs where the first
|
---|
| 880 | instruction is a COPY and the second is an ADD. In this case, all COPY
|
---|
| 881 | instructions have size 4 with mode ranging from 0 to 8 and all the ADD
|
---|
| 882 | instructions have size 1. Thus, the entry with largest index 255
|
---|
| 883 | combines a COPY instruction of size 4 and mode 8 with an ADD instruction
|
---|
| 884 | of size 1.
|
---|
| 885 |
|
---|
| 886 | The choice of the minimum size 4 for COPY instructions in the default code
|
---|
| 887 | table was made from experiments that showed that excluding small matches
|
---|
| 888 | (less then 4 bytes long) improved the compression rates.
|
---|
| 889 |
|
---|
| 890 |
|
---|
| 891 | 6. DECODING A TARGET WINDOW
|
---|
| 892 |
|
---|
| 893 | Section 4.3 discusses that the delta instructions and associated data
|
---|
| 894 | are encoded in three arrays of bytes:
|
---|
| 895 |
|
---|
| 896 | Data section for ADDs and RUNs,
|
---|
| 897 | Instructions and sizes section, and
|
---|
| 898 | Addresses section for COPYs.
|
---|
| 899 |
|
---|
| 900 |
|
---|
| 901 | Further, these data sections may have been further compressed by some
|
---|
| 902 | secondary compressor. Assuming that any such compressed data has been
|
---|
| 903 | decompressed so that we now have three arrays:
|
---|
| 904 |
|
---|
| 905 | inst: bytes coding the instructions and sizes.
|
---|
| 906 | data: unmatched data associated with ADDs and RUNs.
|
---|
| 907 | addr: bytes coding the addresses of COPYs.
|
---|
| 908 |
|
---|
| 909 | These arrays are organized as follows:
|
---|
| 910 |
|
---|
| 911 | inst:
|
---|
| 912 | a sequence of (index, [size1], [size2]) tuples, where "index"
|
---|
| 913 | is an index into the instruction code table, and size1 and size2
|
---|
| 914 | are integers that MAY or MAY NOT be included in the tuple as
|
---|
| 915 | follows. The entry with the given "index" in the instruction
|
---|
| 916 | code table potentially defines two delta instructions. If the
|
---|
| 917 | first delta instruction is not a VCD_NOOP and its size is zero,
|
---|
| 918 | then size1 MUST be present. Otherwise, size1 MUST be omitted and
|
---|
| 919 | the size of the instruction (if it is not VCD_NOOP) is as defined
|
---|
| 920 | in the table. The presence or absence of size2 is defined
|
---|
| 921 | similarly with respect to the second delta instruction.
|
---|
| 922 |
|
---|
| 923 | data:
|
---|
| 924 | a sequence of data values, encoded as bytes.
|
---|
| 925 |
|
---|
| 926 | addr:
|
---|
| 927 | a sequence of address values. Addresses are normally encoded as
|
---|
| 928 | integers as described in Section 2 (i.e., base 128).
|
---|
| 929 | Since the same cache emits addresses in the range [0,255],
|
---|
| 930 | however, same cache addresses are always encoded as a
|
---|
| 931 | single byte.
|
---|
| 932 |
|
---|
| 933 | To summarize, each tuple in the "inst" array includes an index to some
|
---|
| 934 | entry in the instruction code table that determines:
|
---|
| 935 |
|
---|
| 936 | a. Whether one or two instructions were encoded and their types.
|
---|
| 937 |
|
---|
| 938 | b. If the instructions have their sizes encoded separately, these
|
---|
| 939 | sizes will follow, in order, in the tuple.
|
---|
| 940 |
|
---|
| 941 | c. If the instructions have accompanying data, i.e., ADDs or RUNs,
|
---|
| 942 | their data will be in the array "data".
|
---|
| 943 |
|
---|
| 944 | d. Similarly, if the instructions are COPYs, the coded addresses are
|
---|
| 945 | found in the array "addr".
|
---|
| 946 |
|
---|
| 947 | The decoding procedure simply processes the arrays by reading one code
|
---|
| 948 | index at a time, looking up the corresponding instruction code entry,
|
---|
| 949 | then consuming the respective sizes, data and addresses following the
|
---|
| 950 | directions in this entry. In other words, the decoder maintains an implicit
|
---|
| 951 | next-element pointer for each array; "consuming" an instruction tuple,
|
---|
| 952 | data, or address value implies incrementing the associated pointer.
|
---|
| 953 |
|
---|
| 954 | For example, if during the processing of the target window, the next
|
---|
| 955 | unconsumed tuple in the inst array has index value 19, then the first
|
---|
| 956 | instruction is a COPY, whose size is found as the immediately following
|
---|
| 957 | integer in the inst array. Since the mode of this COPY instruction is
|
---|
| 958 | VCD_SELF, the corresponding address is found by consuming the next
|
---|
| 959 | integer in the addr array. The data array is left intact. As the second
|
---|
| 960 | instruction for code index 19 is a NOOP, this tuple is finished.
|
---|
| 961 |
|
---|
| 962 |
|
---|
| 963 | 7. APPLICATION-DEFINED CODE TABLES
|
---|
| 964 |
|
---|
| 965 | Although the default code table used in Vcdiff is good for general
|
---|
| 966 | purpose encoders, there are times when other code tables may perform
|
---|
| 967 | better. For example, to code a file with many identical segments of data,
|
---|
| 968 | it may be advantageous to have a COPY instruction with the specific size
|
---|
| 969 | of these data segments so that the instruction can be encoded in a single
|
---|
| 970 | byte. Such a special code table MUST then be encoded in the delta file
|
---|
| 971 | so that the decoder can reconstruct it before decoding the data.
|
---|
| 972 |
|
---|
| 973 | Vcdiff allows an application-defined code table to be specified
|
---|
| 974 | in a delta file with the following data:
|
---|
| 975 |
|
---|
| 976 | Size of near cache - byte
|
---|
| 977 | Size of same cache - byte
|
---|
| 978 | Compressed code table data
|
---|
| 979 |
|
---|
| 980 | The "compressed code table data" encodes the delta between the default
|
---|
| 981 | code table (source) and the new code table (target) in the same manner as
|
---|
| 982 | described in Section 4.3 for encoding a target window in terms of a
|
---|
| 983 | source window. This delta is computed using the following steps:
|
---|
| 984 |
|
---|
| 985 | a. Convert the new instruction code table into a string, "code", of
|
---|
| 986 | 1536 bytes using the below steps in order:
|
---|
| 987 |
|
---|
| 988 | i. Add in order the 256 bytes representing the types of the first
|
---|
| 989 | instructions in the instruction pairs.
|
---|
| 990 | ii. Add in order the 256 bytes representing the types of the second
|
---|
| 991 | instructions in the instruction pairs.
|
---|
| 992 | iii. Add in order the 256 bytes representing the sizes of the first
|
---|
| 993 | instructions in the instruction pairs.
|
---|
| 994 | iv. Add in order the 256 bytes representing the sizes of the second
|
---|
| 995 | instructions in the instruction pairs.
|
---|
| 996 | v. Add in order the 256 bytes representing the modes of the first
|
---|
| 997 | instructions in the instruction pairs.
|
---|
| 998 | vi. Add in order the 256 bytes representing the modes of the second
|
---|
| 999 | instructions in the instruction pairs.
|
---|
| 1000 |
|
---|
| 1001 | b. Similarly, convert the default instruction code table into
|
---|
| 1002 | a string "dflt".
|
---|
| 1003 |
|
---|
| 1004 | c. Treat the string "code" as a target window and "dflt" as the
|
---|
| 1005 | corresponding source data and apply an encoding algorithm to
|
---|
| 1006 | compute the delta encoding of "code" in terms of "dflt".
|
---|
| 1007 | This computation MUST use the default code table for encoding
|
---|
| 1008 | the delta instructions.
|
---|
| 1009 |
|
---|
| 1010 | The decoder can then reverse the above steps to decode the compressed
|
---|
| 1011 | table data using the method of Section 6, employing the default code
|
---|
| 1012 | table, to generate the new code table. Note that the decoder does not
|
---|
| 1013 | need to know anything about the details of the encoding algorithm used
|
---|
| 1014 | in step (c). The decoder is still able to decode the new code table
|
---|
| 1015 | because the Vcdiff format is independent from the choice of encoding
|
---|
| 1016 | algorithm, and because the encoder in step (c) uses the known, default
|
---|
| 1017 | code table.
|
---|
| 1018 |
|
---|
| 1019 |
|
---|
| 1020 | 8. PERFORMANCE
|
---|
| 1021 |
|
---|
| 1022 | The encoding format is compact. For compression only, using the LZ-77
|
---|
| 1023 | string parsing strategy and without any secondary compressors, the typical
|
---|
| 1024 | compression rate is better than Unix compress and close to gzip. For
|
---|
| 1025 | differencing, the data format is better than all known methods in
|
---|
| 1026 | terms of its stated goal, which is primarily decoding speed and
|
---|
| 1027 | encoding efficiency.
|
---|
| 1028 |
|
---|
| 1029 | We compare the performance of compress, gzip and Vcdiff using the
|
---|
| 1030 | archives of three versions of the Gnu C compiler, gcc-2.95.1.tar,
|
---|
| 1031 | gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an
|
---|
| 1032 | SGI-MIPS3, 400MHZ. Gzip was used at its default compression level.
|
---|
| 1033 | Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13).
|
---|
| 1034 | As string and window matching typically dominates the computation during
|
---|
| 1035 | compression, the Vcdiff compression times were directly due to the
|
---|
| 1036 | algorithms used in the Vcodex/Vcdiff software. However, the decompression
|
---|
| 1037 | times should be generic and representative of any good implementation
|
---|
| 1038 | of the Vcdiff data format. Timing was done by running each program
|
---|
| 1039 | three times and taking the average of the total cpu+system times.
|
---|
| 1040 |
|
---|
| 1041 | Below are the different Vcdiff runs:
|
---|
| 1042 |
|
---|
| 1043 | Vcdiff: vcdiff is used as compressor only.
|
---|
| 1044 |
|
---|
| 1045 | Vcdiff-d: vcdiff is used as a differencer only. That is, it only
|
---|
| 1046 | compares target data against source data. Since the files
|
---|
| 1047 | involved are large, they are broken into windows. In this
|
---|
| 1048 | case, each target window starting at some file offset in
|
---|
| 1049 | the target file is compared against a source window with
|
---|
| 1050 | the same file offset (in the source file). The source
|
---|
| 1051 | window is also slightly larger than the target window
|
---|
| 1052 | to increase matching opportunities. The -d option also gives
|
---|
| 1053 | a hint to the string matching algorithm of Vcdiff that
|
---|
| 1054 | the two files are very similar with long stretches of matches.
|
---|
| 1055 | The algorithm takes advantage of this to minimize its
|
---|
| 1056 | processing of source data and save time.
|
---|
| 1057 |
|
---|
| 1058 | Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare
|
---|
| 1059 | target data against target data as applicable. Thus, vcdiff
|
---|
| 1060 | both computes differences and compresses data. The windowing
|
---|
| 1061 | algorithm is the same as above. However, the above hint is
|
---|
| 1062 | recinded in this case.
|
---|
| 1063 |
|
---|
| 1064 | Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm
|
---|
| 1065 | uses a content-based heuristic to select source data segments
|
---|
| 1066 | that are more likely to match with a given target window.
|
---|
| 1067 | Thus, the source data segment selected for a target window
|
---|
| 1068 | often will not be aligned with the file offsets of this
|
---|
| 1069 | target window.
|
---|
| 1070 |
|
---|
| 1071 |
|
---|
| 1072 | gcc-2.95.1 gcc-2.95.2 compression decompression
|
---|
| 1073 | raw size 55746560 55797760
|
---|
| 1074 | compress - 19939390 13.85s 7.09s
|
---|
| 1075 | gzip - 12973443 42.99s 5.35s
|
---|
| 1076 | Vcdiff - 15358786 20.04s 4.65s
|
---|
| 1077 | Vcdiff-d - 100971 10.93s 1.92s
|
---|
| 1078 | Vcdiff-dc - 97246 20.03s 1.84s
|
---|
| 1079 | Vcdiff-dcs - 256445 44.81s 1.84s
|
---|
| 1080 |
|
---|
| 1081 | TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1
|
---|
| 1082 |
|
---|
| 1083 |
|
---|
| 1084 | TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the
|
---|
| 1085 | sizes of the compressed results. As a pure compressor, the compression
|
---|
| 1086 | rate for Vcdiff is worse than gzip and better than compress. The last
|
---|
| 1087 | three rows shows that when two file versions are very similar, differencing
|
---|
| 1088 | can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use
|
---|
| 1089 | the same simple window selection method but Vcdiff-dc also does compression
|
---|
| 1090 | so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on
|
---|
| 1091 | data content to search for source data that likely will match a given target
|
---|
| 1092 | window. Although it does a good job, the heuristic did not always find the
|
---|
| 1093 | best matches which are given by the simple algorithm of Vcdiff-d. As a
|
---|
| 1094 | result, the output size is slightly larger. Note also that there is a large
|
---|
| 1095 | cost in computing matching windows this way. Finally, the compression times
|
---|
| 1096 | of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude
|
---|
| 1097 | that the compression feature causes the additional time in Vcdiff-dc
|
---|
| 1098 | relative to Vcdiff-d. However, this is not the case. The hint given to
|
---|
| 1099 | the Vcdiff string matching algorithm that the two files are likely to
|
---|
| 1100 | have very long stretches of matches helps the algorithm to minimize
|
---|
| 1101 | processing of the "source data", thus saving half the time. However, as we
|
---|
| 1102 | shall see below when this hint is wrong, the result is even longer time.
|
---|
| 1103 |
|
---|
| 1104 |
|
---|
| 1105 | gcc-2.95.2 gcc-2.95.3 compression decompression
|
---|
| 1106 | raw size 55797760 55787520
|
---|
| 1107 | compress - 19939453 13.54s 7.00s
|
---|
| 1108 | gzip - 12998097 42.63s 5.62s
|
---|
| 1109 | Vcdiff - 15371737 20.09s 4.74s
|
---|
| 1110 | Vcdiff-d - 26383849 71.41s 6.41s
|
---|
| 1111 | Vcdiff-dc - 14461203 42.48s 4.82s
|
---|
| 1112 | Vcdiff-dcs - 1248543 61.18s 1.99s
|
---|
| 1113 |
|
---|
| 1114 | TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2
|
---|
| 1115 |
|
---|
| 1116 |
|
---|
| 1117 | TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and
|
---|
| 1118 | the sizes of the compressed results. In this case, the tar file of
|
---|
| 1119 | gcc-2.95.3 is rearranged in a way that makes the straightforward method
|
---|
| 1120 | of matching file offsets for source and target windows fail. As a
|
---|
| 1121 | result, Vcdiff-d performs rather dismally both in time and output size.
|
---|
| 1122 | The large time for Vcdiff-d is directly due to fact that the string
|
---|
| 1123 | matching algorithm has to work much harder to find matches when the hint
|
---|
| 1124 | that two files have long matching stretches fails to hold. On the other
|
---|
| 1125 | hand, Vcdiff-dc does both differencing and compression resulting in good
|
---|
| 1126 | output size. Finally, the window searching heuristic used in Vcdiff-dcs is
|
---|
| 1127 | effective in finding the right matching source windows for target windows
|
---|
| 1128 | resulting a small output size. This shows why the data format needs to
|
---|
| 1129 | have a way to specify matching windows to gain performance. Finally,
|
---|
| 1130 | we note that the decoding times are always good regardless of how
|
---|
| 1131 | the string matching or window searching algorithms perform.
|
---|
| 1132 |
|
---|
| 1133 |
|
---|
| 1134 | 9. FURTHER ISSUES
|
---|
| 1135 |
|
---|
| 1136 | This document does not address a few issues:
|
---|
| 1137 |
|
---|
| 1138 | Secondary compressors:
|
---|
| 1139 | As discussed in Section 4.3, certain sections in the delta encoding
|
---|
| 1140 | of a window may be further compressed by a secondary compressor.
|
---|
| 1141 | In our experience, the basic Vcdiff format is adequate for most
|
---|
| 1142 | purposes so that secondary compressors are seldom needed. In
|
---|
| 1143 | particular, for normal use of data differencing where the files to
|
---|
| 1144 | be compared have long stretches of matches, much of the gain in
|
---|
| 1145 | compression rate is already achieved by normal string matching.
|
---|
| 1146 | Thus, the use of secondary compressors is seldom needed in this case.
|
---|
| 1147 | However, for applications beyond differencing of such nearly identical
|
---|
| 1148 | files, secondary compressors may be needed to achieve maximal
|
---|
| 1149 | compressed results.
|
---|
| 1150 |
|
---|
| 1151 | Therefore, we recommend to leave the Vcdiff data format defined
|
---|
| 1152 | as in this document so that the use of secondary compressors
|
---|
| 1153 | can be implemented when they become needed in the future.
|
---|
| 1154 | The formats of the compressed data via such compressors or any
|
---|
| 1155 | compressors that may be defined in the future are left open to
|
---|
| 1156 | their implementations. These could include Huffman encoding,
|
---|
| 1157 | arithmetic encoding, and splay tree encoding [8,9].
|
---|
| 1158 |
|
---|
| 1159 | Large file system vs. small file system:
|
---|
| 1160 | As discussed in Section 4, a target window in a large file may be
|
---|
| 1161 | compared against some source window in another file or in the same
|
---|
| 1162 | file (from some earlier part). In that case, the file offset of the
|
---|
| 1163 | source window is specified as a variable-sized integer in the delta
|
---|
| 1164 | encoding. There is a possibility that the encoding was computed on
|
---|
| 1165 | a system supporting much larger files than in a system where
|
---|
| 1166 | the data may be decoded (e.g., 64-bit file systems vs. 32-bit file
|
---|
| 1167 | systems). In that case, some target data may not be recoverable.
|
---|
| 1168 | This problem could afflict any compression format, and ought
|
---|
| 1169 | to be resolved with a generic negotiation mechanism in the
|
---|
| 1170 | appropriate protocol(s).
|
---|
| 1171 |
|
---|
| 1172 |
|
---|
| 1173 | 10. SUMMARY
|
---|
| 1174 |
|
---|
| 1175 | We have described Vcdiff, a general and portable encoding format for
|
---|
| 1176 | compression and differencing. The format is good in that it allows
|
---|
| 1177 | implementing a decoder without knowledge of the encoders. Further,
|
---|
| 1178 | ignoring the use of secondary compressors not defined within the format,
|
---|
| 1179 | the decoding algorithms runs in linear time and requires working space
|
---|
| 1180 | proportional to window sizes.
|
---|
| 1181 |
|
---|
| 1182 |
|
---|
| 1183 |
|
---|
| 1184 | 11. ACKNOWLEDGEMENTS
|
---|
| 1185 |
|
---|
| 1186 | Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff
|
---|
| 1187 | who provided much encouragement to publicize Vcdiff. In particular, Jeff
|
---|
| 1188 | helped clarifying the description of the data format presented here.
|
---|
| 1189 |
|
---|
| 1190 |
|
---|
| 1191 |
|
---|
| 1192 | 12. SECURITY CONSIDERATIONS
|
---|
| 1193 |
|
---|
| 1194 | Vcdiff only provides a format to encode compressed and differenced data.
|
---|
| 1195 | It does not address any issues concerning how such data are, in fact,
|
---|
| 1196 | stored in a given file system or the run-time memory of a computer system.
|
---|
| 1197 | Therefore, we do not anticipate any security issues with respect to Vcdiff.
|
---|
| 1198 |
|
---|
| 1199 |
|
---|
| 1200 |
|
---|
| 1201 | 13. SOURCE CODE AVAILABILITY
|
---|
| 1202 |
|
---|
| 1203 | Vcdiff is implemented as a data transforming method in Phong Vo's
|
---|
| 1204 | Vcodex library. AT&T Corp. has made the source code for Vcodex available
|
---|
| 1205 | for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11].
|
---|
| 1206 | The source code and according license is accessible at the below URL:
|
---|
| 1207 |
|
---|
| 1208 | http://www.research.att.com/sw/tools
|
---|
| 1209 |
|
---|
| 1210 |
|
---|
| 1211 | 14. INTELLECTUAL PROPERTY RIGHTS
|
---|
| 1212 |
|
---|
| 1213 | The IETF has been notified of intellectual property rights claimed in
|
---|
| 1214 | regard to some or all of the specification contained in this
|
---|
| 1215 | document. For more information consult the online list of claimed
|
---|
| 1216 | rights, at <http://www.ietf.org/ipr.html>.
|
---|
| 1217 |
|
---|
| 1218 | The IETF takes no position regarding the validity or scope of any
|
---|
| 1219 | intellectual property or other rights that might be claimed to
|
---|
| 1220 | pertain to the implementation or use of the technology described in
|
---|
| 1221 | this document or the extent to which any license under such rights
|
---|
| 1222 | might or might not be available; neither does it represent that it
|
---|
| 1223 | has made any effort to identify any such rights. Information on the
|
---|
| 1224 | IETF's procedures with respect to rights in standards-track and
|
---|
| 1225 | standards-related documentation can be found in BCP-11. Copies of
|
---|
| 1226 | claims of rights made available for publication and any assurances of
|
---|
| 1227 | licenses to be made available, or the result of an attempt made to
|
---|
| 1228 | obtain a general license or permission for the use of such
|
---|
| 1229 | proprietary rights by implementors or users of this specification can
|
---|
| 1230 | be obtained from the IETF Secretariat.
|
---|
| 1231 |
|
---|
| 1232 |
|
---|
| 1233 |
|
---|
| 1234 | 15. IANA CONSIDERATIONS
|
---|
| 1235 |
|
---|
| 1236 | The Internet Assigned Numbers Authority (IANA) administers the number
|
---|
| 1237 | space for Secondary Compressor ID values. Values and their meaning
|
---|
| 1238 | must be documented in an RFC or other peer-reviewed, permanent, and
|
---|
| 1239 | readily available reference, in sufficient detail so that
|
---|
| 1240 | interoperability between independent implementations is possible.
|
---|
| 1241 | Subject to these constraints, name assignments are First Come, First
|
---|
| 1242 | Served - see RFC2434 [13]. Legal ID values are in the range 1..255.
|
---|
| 1243 |
|
---|
| 1244 | This document does not define any values in this number space.
|
---|
| 1245 |
|
---|
| 1246 |
|
---|
| 1247 | 16. REFERENCES
|
---|
| 1248 |
|
---|
| 1249 | [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
|
---|
| 1250 | Practical Reusable Unix Software, Editor B. Krishnamurthy,
|
---|
| 1251 | John Wiley & Sons, Inc., 1995.
|
---|
| 1252 |
|
---|
| 1253 | [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
|
---|
| 1254 | Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977.
|
---|
| 1255 |
|
---|
| 1256 | [3] W. Tichy, The String-to-String Correction Problem with Block Moves,
|
---|
| 1257 | ACM Transactions on Computer Systems, 2(4):309-321, November 1984.
|
---|
| 1258 |
|
---|
| 1259 | [4] E.M. McCreight, A Space-Economical Suffix Tree Construction
|
---|
| 1260 | Algorithm, Journal of the ACM, 23:262-272, 1976.
|
---|
| 1261 |
|
---|
| 1262 | [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms,
|
---|
| 1263 | IEEE Software Configuration and Maintenance Workshop, 1996.
|
---|
| 1264 |
|
---|
| 1265 | [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis,
|
---|
| 1266 | ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998.
|
---|
| 1267 |
|
---|
| 1268 | [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
|
---|
| 1269 | Proc. of the Summer '91 Usenix Conference, 1991.
|
---|
| 1270 |
|
---|
| 1271 | [8] D. W. Jones, Application of Splay Trees to Data Compression,
|
---|
| 1272 | CACM, 31(8):996:1007.
|
---|
| 1273 |
|
---|
| 1274 | [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1,
|
---|
| 1275 | M&T Books, New York, NY, 1995.
|
---|
| 1276 |
|
---|
| 1277 | [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy,
|
---|
| 1278 | Potential benefits of delta encoding and data compression for HTTP,
|
---|
| 1279 | SIGCOMM '97, Cannes, France, 1997.
|
---|
| 1280 |
|
---|
| 1281 | [11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann,
|
---|
| 1282 | Y. Goland, and A. Van Hoff, Delta Encoding in HTTP,
|
---|
| 1283 | IETF, draft-mogul-http-delta-10, 2001.
|
---|
| 1284 |
|
---|
| 1285 | [12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels,
|
---|
| 1286 | RFC 2119, March 1997.
|
---|
| 1287 |
|
---|
| 1288 | [13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA
|
---|
| 1289 | Considerations Section in RFCs, RFC2434, October 1998.
|
---|
| 1290 |
|
---|
| 1291 |
|
---|
| 1292 |
|
---|
| 1293 | 17. AUTHOR'S ADDRESS
|
---|
| 1294 |
|
---|
| 1295 | Kiem-Phong Vo (main contact)
|
---|
| 1296 | AT&T Labs, Room D223
|
---|
| 1297 | 180 Park Avenue
|
---|
| 1298 | Florham Park, NJ 07932
|
---|
| 1299 | Email: kpv@research.att.com
|
---|
| 1300 | Phone: 1 973 360 8630
|
---|
| 1301 |
|
---|
| 1302 | David G. Korn
|
---|
| 1303 | AT&T Labs, Room D237
|
---|
| 1304 | 180 Park Avenue
|
---|
| 1305 | Florham Park, NJ 07932
|
---|
| 1306 | Email: dgk@research.att.com
|
---|
| 1307 | Phone: 1 973 360 8602
|
---|
| 1308 |
|
---|
| 1309 | Jeffrey C. Mogul
|
---|
| 1310 | Western Research Laboratory
|
---|
| 1311 | Compaq Computer Corporation
|
---|
| 1312 | 250 University Avenue
|
---|
| 1313 | Palo Alto, California, 94305, U.S.A.
|
---|
| 1314 | Email: JeffMogul@acm.org
|
---|
| 1315 | Phone: 1 650 617 3304 (email preferred)
|
---|
| 1316 |
|
---|
| 1317 | Joshua P. MacDonald
|
---|
| 1318 | Computer Science Division
|
---|
| 1319 | University of California, Berkeley
|
---|
| 1320 | 345 Soda Hall
|
---|
| 1321 | Berkeley, CA 94720
|
---|
| 1322 | Email: jmacd@cs.berkeley.edu
|
---|