1 | David G. Korn, AT&T Labs
|
---|
2 | Joshua P. MacDonald, UC Berkeley
|
---|
3 | Jeffrey C. Mogul, Compaq WRL
|
---|
4 | Internet-Draft Kiem-Phong Vo, AT&T Labs
|
---|
5 | Expires: 09 November 2002 09 November 2001
|
---|
6 |
|
---|
7 |
|
---|
8 | The VCDIFF Generic Differencing and Compression Data Format
|
---|
9 |
|
---|
10 | draft-korn-vcdiff-06.txt
|
---|
11 |
|
---|
12 |
|
---|
13 |
|
---|
14 | Status of this Memo
|
---|
15 |
|
---|
16 | This document is an Internet-Draft and is in full conformance
|
---|
17 | with all provisions of Section 10 of RFC2026.
|
---|
18 |
|
---|
19 | Internet-Drafts are working documents of the Internet Engineering
|
---|
20 | Task Force (IETF), its areas, and its working groups. Note that
|
---|
21 | other groups may also distribute working documents as
|
---|
22 | Internet-Drafts.
|
---|
23 |
|
---|
24 | Internet-Drafts are draft documents valid for a maximum of six
|
---|
25 | months and may be updated, replaced, or obsoleted by other
|
---|
26 | documents at any time. It is inappropriate to use Internet-
|
---|
27 | Drafts as reference material or to cite them other than as
|
---|
28 | "work in progress."
|
---|
29 |
|
---|
30 | The list of current Internet-Drafts can be accessed at
|
---|
31 | http://www.ietf.org/ietf/1id-abstracts.txt
|
---|
32 |
|
---|
33 | The list of Internet-Draft Shadow Directories can be accessed at
|
---|
34 | http://www.ietf.org/shadow.html.
|
---|
35 |
|
---|
36 |
|
---|
37 | Abstract
|
---|
38 |
|
---|
39 | This memo describes a general, efficient and portable data format
|
---|
40 | suitable for encoding compressed and/or differencing data so that
|
---|
41 | they can be easily transported among computers.
|
---|
42 |
|
---|
43 |
|
---|
44 | Table of Contents:
|
---|
45 |
|
---|
46 | 1. EXECUTIVE SUMMARY ............................................ 2
|
---|
47 | 2. CONVENTIONS .................................................. 3
|
---|
48 | 3. DELTA INSTRUCTIONS ........................................... 4
|
---|
49 | 4. DELTA FILE ORGANIZATION ...................................... 5
|
---|
50 | 5. DELTA INSTRUCTION ENCODING ................................... 9
|
---|
51 | 6. DECODING A TARGET WINDOW ..................................... 14
|
---|
52 | 7. APPLICATION-DEFINED CODE TABLES .............................. 16
|
---|
53 | 8. PERFORMANCE .................................................. 16
|
---|
54 | 9. FURTHER ISSUES ............................................... 17
|
---|
55 | 10. SUMMARY ...................................................... 18
|
---|
56 | 11. ACKNOWLEDGEMENTS ............................................. 18
|
---|
57 | 12. SECURITY CONSIDERATIONS ...................................... 18
|
---|
58 | 13. SOURCE CODE AVAILABILITY ..................................... 18
|
---|
59 | 14. INTELLECTUAL PROPERTY RIGHTS ................................. 18
|
---|
60 | 15. IANA CONSIDERATIONS .......................................... 19
|
---|
61 | 16. REFERENCES ................................................... 19
|
---|
62 | 17. AUTHOR'S ADDRESS ............................................. 20
|
---|
63 |
|
---|
64 |
|
---|
65 | 1. EXECUTIVE SUMMARY
|
---|
66 |
|
---|
67 | Compression and differencing techniques can greatly improve storage
|
---|
68 | and transmission of files and file versions. Since files are often
|
---|
69 | transported across machines with distinct architectures and performance
|
---|
70 | characteristics, such data should be encoded in a form that is portable
|
---|
71 | and can be decoded with little or no knowledge of the encoders.
|
---|
72 | This document describes Vcdiff, a compact portable encoding format
|
---|
73 | designed for these purposes.
|
---|
74 |
|
---|
75 | Data differencing is the process of computing a compact and invertible
|
---|
76 | encoding of a "target file" given a "source file". Data compression
|
---|
77 | is similar but without the use of source data. The UNIX utilities diff,
|
---|
78 | compress, and gzip are well-known examples of data differencing and
|
---|
79 | compression tools. For data differencing, the computed encoding is
|
---|
80 | called a "delta file", and, for data compression, it is called
|
---|
81 | a "compressed file". Delta and compressed files are good for storage
|
---|
82 | and transmission as they are often smaller than the originals.
|
---|
83 |
|
---|
84 | Data differencing and data compression are traditionally treated
|
---|
85 | as distinct types of data processing. However, as shown in the Vdelta
|
---|
86 | technique by Korn and Vo [1], compression can be thought of as a special
|
---|
87 | case of differencing in which the source data is empty. The basic idea
|
---|
88 | is to unify the string parsing scheme used in the Lempel-Ziv'77 style
|
---|
89 | compressors [2], and the block-move technique of Tichy [3]. Loosely
|
---|
90 | speaking, this works as follows:
|
---|
91 |
|
---|
92 | a. Concatenate source and target data.
|
---|
93 | b. Parse the data from left to right as in LZ'77 but
|
---|
94 | make sure that a parsed segment starts the target data.
|
---|
95 | c. Start to output when reaching target data.
|
---|
96 |
|
---|
97 | Parsing is based on string matching algorithms such as suffix trees [4]
|
---|
98 | or hashing with different time and space performance characteristics.
|
---|
99 | Vdelta uses a fast string matching algorithm that requires less memory
|
---|
100 | than other techniques [5,6]. However, even with this algorithm, the
|
---|
101 | memory requirement can still be prohibitive for large files. A common
|
---|
102 | way to deal with memory limitation is to partition an input file into
|
---|
103 | chunks called "windows" and process them separately. Here, except for
|
---|
104 | unpublished work by Vo, little has been done on designing effective
|
---|
105 | windowing schemes. Current techniques, including Vdelta, simply use
|
---|
106 | source and target windows with corresponding addresses across source
|
---|
107 | and target files.
|
---|
108 |
|
---|
109 | String matching and windowing algorithms have large influence on the
|
---|
110 | compression rate of delta and compressed files. However, it is desirable
|
---|
111 | to have a portable encoding format that is independent of such algorithms.
|
---|
112 | This enables construction of client-server applications in which a server
|
---|
113 | may serve clients with unknown computing characteristics. Unfortunately,
|
---|
114 | all current differencing and compressing tools, including Vdelta, fall
|
---|
115 | short in this respect. Their storage formats are closely intertwined
|
---|
116 | with the implemented string matching and/or windowing algorithms.
|
---|
117 |
|
---|
118 | The encoding format Vcdiff proposed here addresses the above issues.
|
---|
119 | Vcdiff achieves the below characteristics:
|
---|
120 |
|
---|
121 | Output compactness:
|
---|
122 | The basic encoding format compactly represents compressed or delta
|
---|
123 | files. Applications can further extend the basic encoding format
|
---|
124 | with "secondary encoders" to achieve more compression.
|
---|
125 |
|
---|
126 | Data portability:
|
---|
127 | The basic encoding format is free from machine byte order and
|
---|
128 | word size issues. This allows data to be encoded on one machine
|
---|
129 | and decoded on a different machine with different architecture.
|
---|
130 |
|
---|
131 | Algorithm genericity:
|
---|
132 | The decoding algorithm is independent from string matching and
|
---|
133 | windowing algorithms. This allows competition among implementations
|
---|
134 | of the encoder while keeping the same decoder.
|
---|
135 |
|
---|
136 | Decoding efficiency:
|
---|
137 | Except for secondary encoder issues, the decoding algorithm runs
|
---|
138 | in time proportional to the size of the target file and uses space
|
---|
139 | proportional to the maximal window size. Vcdiff differs from more
|
---|
140 | conventional compressors in that it uses only byte-aligned
|
---|
141 | data, thus avoiding bit-level operations, which improves
|
---|
142 | decoding speed at the slight cost of compression efficiency.
|
---|
143 |
|
---|
144 | The Vcdiff data format and the algorithms for decoding data shall be
|
---|
145 | described next. Since Vcdiff treats compression as a special case of
|
---|
146 | differencing, we shall use the term "delta file" to indicate the
|
---|
147 | compressed output for both cases.
|
---|
148 |
|
---|
149 |
|
---|
150 | 2. CONVENTIONS
|
---|
151 |
|
---|
152 | The basic data unit is a byte. For portability, Vcdiff shall limit
|
---|
153 | a byte to its lower eight bits even on machines with larger bytes.
|
---|
154 | The bits in a byte are ordered from right to left so that the least
|
---|
155 | significant bit (LSB) has value 1, and the most significant bit (MSB),
|
---|
156 | has value 128.
|
---|
157 |
|
---|
158 | For purposes of exposition in this document, we adopt the convention
|
---|
159 | that the LSB is numbered 0, and the MSB is numbered 7. Bit numbers
|
---|
160 | never appear in the encoded format itself.
|
---|
161 |
|
---|
162 | Vcdiff encodes unsigned integer values using a portable variable-sized
|
---|
163 | format (originally introduced in the Sfio library [7]). This encoding
|
---|
164 | treats an integer as a number in base 128. Then, each digit in this
|
---|
165 | representation is encoded in the lower seven bits of a byte. Except for
|
---|
166 | the least significant byte, other bytes have their most significant bit
|
---|
167 | turned on to indicate that there are still more digits in the encoding.
|
---|
168 | The two key properties of this integer encoding that are beneficial
|
---|
169 | to a data compression format are:
|
---|
170 |
|
---|
171 | a. The encoding is portable among systems using 8-bit bytes, and
|
---|
172 | b. Small values are encoded compactly.
|
---|
173 |
|
---|
174 | For example, consider the value 123456789 which can be represented with
|
---|
175 | four 7-bit digits whose values are 58, 111, 26, 21 in order from most
|
---|
176 | to least significant. Below is the 8-bit byte encoding of these digits.
|
---|
177 | Note that the MSBs of 58, 111 and 26 are on.
|
---|
178 |
|
---|
179 | +-------------------------------------------+
|
---|
180 | | 10111010 | 11101111 | 10011010 | 00010101 |
|
---|
181 | +-------------------------------------------+
|
---|
182 | MSB+58 MSB+111 MSB+26 0+21
|
---|
183 |
|
---|
184 |
|
---|
185 | Henceforth, the terms "byte" and "integer" will refer to a byte and an
|
---|
186 | unsigned integer as described.
|
---|
187 |
|
---|
188 |
|
---|
189 | From time to time, algorithms are exhibited to clarify the descriptions
|
---|
190 | of parts of the Vcdiff format. On such occasions, the C language will be
|
---|
191 | used to make precise the algorithms. The C code shown in this
|
---|
192 | document is meant for clarification only, and is not part of the
|
---|
193 | actual specification of the Vcdiff format.
|
---|
194 |
|
---|
195 | In this specification, the key words "MUST", "MUST NOT",
|
---|
196 | "SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as
|
---|
197 | described in RFC2119 [12].
|
---|
198 |
|
---|
199 |
|
---|
200 | 3. DELTA INSTRUCTIONS
|
---|
201 |
|
---|
202 | A large target file is partitioned into non-overlapping sections
|
---|
203 | called "target windows". These target windows are processed separately
|
---|
204 | and sequentially based on their order in the target file.
|
---|
205 |
|
---|
206 | A target window T of length t may be compared against some source data
|
---|
207 | segment S of length s. By construction, this source data segment S
|
---|
208 | comes either from the source file, if one is used, or from a part of
|
---|
209 | the target file earlier than T. In this way, during decoding, S is
|
---|
210 | completely known when T is being decoded.
|
---|
211 |
|
---|
212 | The choices of T, t, S and s are made by some window selection algorithm
|
---|
213 | which can greatly affect the size of the encoding. However, as seen later,
|
---|
214 | these choices are encoded so that no knowledge of the window selection
|
---|
215 | algorithm is needed during decoding.
|
---|
216 |
|
---|
217 | Assume that S[j] represents the jth byte in S, and T[k] represents
|
---|
218 | the kth byte in T. Then, for the delta instructions, we treat the data
|
---|
219 | windows S and T as substrings of a superstring U formed by concatenating
|
---|
220 | them like this:
|
---|
221 |
|
---|
222 | S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]
|
---|
223 |
|
---|
224 | The "address" of a byte in S or T is referred to by its location in U.
|
---|
225 | For example, the address of T[k] is s+k.
|
---|
226 |
|
---|
227 | The instructions to encode and direct the reconstruction of a target
|
---|
228 | window are called delta instructions. There are three types:
|
---|
229 |
|
---|
230 | ADD: This instruction has two arguments, a size x and a sequence of
|
---|
231 | x bytes to be copied.
|
---|
232 | COPY: This instruction has two arguments, a size x and an address p
|
---|
233 | in the string U. The arguments specify the substring of U that
|
---|
234 | must be copied. We shall assert that such a substring must be
|
---|
235 | entirely contained in either S or T.
|
---|
236 | RUN: This instruction has two arguments, a size x and a byte b that
|
---|
237 | will be repeated x times.
|
---|
238 |
|
---|
239 | Below are example source and target windows and the delta instructions
|
---|
240 | that encode the target window in terms of the source window.
|
---|
241 |
|
---|
242 | a b c d e f g h i j k l m n o p
|
---|
243 | a b c d w x y z e f g h e f g h e f g h e f g h z z z z
|
---|
244 |
|
---|
245 | COPY 4, 0
|
---|
246 | ADD 4, w x y z
|
---|
247 | COPY 4, 4
|
---|
248 | COPY 12, 24
|
---|
249 | RUN 4, z
|
---|
250 |
|
---|
251 |
|
---|
252 | Thus, the first letter 'a' in the target window is at location 16
|
---|
253 | in the superstring. Note that the fourth instruction, "COPY 12, 24",
|
---|
254 | copies data from T itself since address 24 is position 8 in T.
|
---|
255 | This instruction also shows that it is fine to overlap the data to be
|
---|
256 | copied with the data being copied from as long as the latter starts
|
---|
257 | earlier. This enables efficient encoding of periodic sequences,
|
---|
258 | i.e., sequences with regularly repeated subsequences. The RUN instruction
|
---|
259 | is a compact way to encode a sequence repeating the same byte even though
|
---|
260 | such a sequence can be thought of as a periodic sequence with period 1.
|
---|
261 |
|
---|
262 | To reconstruct the target window, one simply processes one delta
|
---|
263 | instruction at a time and copy the data either from the source window
|
---|
264 | or the being reconstructed target window based on the type of the
|
---|
265 | instruction and the associated address, if any.
|
---|
266 |
|
---|
267 |
|
---|
268 | 4. DELTA FILE ORGANIZATION
|
---|
269 |
|
---|
270 | A Vcdiff delta file starts with a Header section followed by a sequence
|
---|
271 | of Window sections. The Header section includes magic bytes to identify
|
---|
272 | the file type, and information concerning data processing beyond the
|
---|
273 | basic encoding format. The Window sections encode the target windows.
|
---|
274 |
|
---|
275 | Below is the overall organization of a delta file. The indented items
|
---|
276 | refine the ones immediately above them. An item in square brackets may
|
---|
277 | or may not be present in the file depending on the information encoded
|
---|
278 | in the Indicator byte above it.
|
---|
279 |
|
---|
280 | Header
|
---|
281 | Header1 - byte
|
---|
282 | Header2 - byte
|
---|
283 | Header3 - byte
|
---|
284 | Header4 - byte
|
---|
285 | Hdr_Indicator - byte
|
---|
286 | [Secondary compressor ID] - byte
|
---|
287 |
|
---|
288 | [@@@ Why is compressor ID not an integer? ]
|
---|
289 | [@@@ If we aren't defining any secondary compressors yet, then it seems
|
---|
290 | that defining the [Secondary compressor ID] and the corresponding
|
---|
291 | VCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value. An
|
---|
292 | implementation of this specification won't be able to decode a VCDIFF
|
---|
293 | encoded with this option if it doesn't know about any secondary
|
---|
294 | compressors. It seems that you should specify the bits related to
|
---|
295 | secondary compressors once you have defined the first a secondary
|
---|
296 | compressor. I can imagine a secondary-compressor might want to supply
|
---|
297 | extra information, such as a dictionary of some kind, in which case
|
---|
298 | this speculative treatment wouldn't go far enough.]
|
---|
299 |
|
---|
300 | [Length of code table data] - integer
|
---|
301 | [Code table data]
|
---|
302 | Size of near cache - byte
|
---|
303 | Size of same cache - byte
|
---|
304 | Compressed code table data
|
---|
305 | Window1
|
---|
306 | Win_Indicator - byte
|
---|
307 | [Source segment size] - integer
|
---|
308 | [Source segment position] - integer
|
---|
309 | The delta encoding of the target window
|
---|
310 | Length of the delta encoding - integer
|
---|
311 | The delta encoding
|
---|
312 | Size of the target window - integer
|
---|
313 | Delta_Indicator - byte
|
---|
314 | Length of data for ADDs and RUNs - integer
|
---|
315 | Length of instructions and sizes - integer
|
---|
316 | Length of addresses for COPYs - integer
|
---|
317 | Data section for ADDs and RUNs - array of bytes
|
---|
318 | Instructions and sizes section - array of bytes
|
---|
319 | Addresses section for COPYs - array of bytes
|
---|
320 | Window2
|
---|
321 | ...
|
---|
322 |
|
---|
323 |
|
---|
324 |
|
---|
325 | 4.1 The Header Section
|
---|
326 |
|
---|
327 | Each delta file starts with a header section organized as below.
|
---|
328 | Note the convention that square-brackets enclose optional items.
|
---|
329 |
|
---|
330 | Header1 - byte = 0xE6
|
---|
331 | Header2 - byte = 0xD3
|
---|
332 | Header3 - byte = 0xD4
|
---|
333 |
|
---|
334 | HMMM
|
---|
335 |
|
---|
336 | 0xD6
|
---|
337 | 0xC3
|
---|
338 | 0xC4
|
---|
339 |
|
---|
340 | Header4 - byte
|
---|
341 | Hdr_Indicator - byte
|
---|
342 | [Secondary compressor ID] - byte
|
---|
343 | [Length of code table data] - integer
|
---|
344 | [Code table data]
|
---|
345 |
|
---|
346 | The first three Header bytes are the ASCII characters 'V', 'C' and 'D'
|
---|
347 | with their most significant bits turned on (in hexadecimal, the values
|
---|
348 | are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to
|
---|
349 | zero. In the future, it might be used to indicate the version of Vcdiff.
|
---|
350 |
|
---|
351 | The Hdr_Indicator byte shows if there are any initialization data
|
---|
352 | required to aid in the reconstruction of data in the Window sections.
|
---|
353 | This byte MAY have non-zero values for either, both, or neither of
|
---|
354 | the two bits VCD_DECOMPRESS and VCD_CODETABLE below:
|
---|
355 |
|
---|
356 | 7 6 5 4 3 2 1 0
|
---|
357 | +-+-+-+-+-+-+-+-+
|
---|
358 | | | | | | | | | |
|
---|
359 | +-+-+-+-+-+-+-+-+
|
---|
360 | ^ ^
|
---|
361 | | |
|
---|
362 | | +-- VCD_DECOMPRESS
|
---|
363 | +---- VCD_CODETABLE
|
---|
364 |
|
---|
365 | If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary
|
---|
366 | compressor may have been used to further compress certain parts of the
|
---|
367 | delta encoding data as described in Sections 4.3 and 6. In that case,
|
---|
368 | the ID of the secondary compressor is given next. If this bit is zero,
|
---|
369 | the compressor ID byte is not included.
|
---|
370 |
|
---|
371 | [@@@ If we aren't defining any secondary compressors yet, then it seems
|
---|
372 | this bit has no real value yet..]
|
---|
373 |
|
---|
374 | If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an
|
---|
375 | application-defined code table is to be used for decoding the delta
|
---|
376 | instructions. This table itself is compressed. The length of the data
|
---|
377 | comprising this compressed code table and the data follow next. Section 7
|
---|
378 | discusses application-defined code tables. If this bit is zero, the code
|
---|
379 | table data length and the code table data are not included.
|
---|
380 |
|
---|
381 | If both bits are set, then the compressor ID byte is included
|
---|
382 | before the code table data length and the code table data.
|
---|
383 |
|
---|
384 |
|
---|
385 | 4.2 The Format of a Window Section
|
---|
386 |
|
---|
387 | Each Window section is organized as follows:
|
---|
388 |
|
---|
389 | Win_Indicator - byte
|
---|
390 | [Source segment length] - integer
|
---|
391 | [Source segment position] - integer
|
---|
392 | The delta encoding of the target window
|
---|
393 |
|
---|
394 |
|
---|
395 | Below are the detail of the various items:
|
---|
396 |
|
---|
397 | [@@@ Here, I want to replace the Win_Indicator with a source-count,
|
---|
398 | followed by source-count length/position pairs?]
|
---|
399 |
|
---|
400 | Win_Indicator:
|
---|
401 | This byte is a set of bits, as shown:
|
---|
402 |
|
---|
403 | 7 6 5 4 3 2 1 0
|
---|
404 | +-+-+-+-+-+-+-+-+
|
---|
405 | | | | | | | | | |
|
---|
406 | +-+-+-+-+-+-+-+-+
|
---|
407 | ^ ^
|
---|
408 | | |
|
---|
409 | | +-- VCD_SOURCE
|
---|
410 | +---- VCD_TARGET
|
---|
411 |
|
---|
412 |
|
---|
413 | If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment
|
---|
414 | of data from the "source" file was used as the corresponding
|
---|
415 | source window of data to encode the target window. The decoder
|
---|
416 | will use this same source data segment to decode the target window.
|
---|
417 |
|
---|
418 | If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment
|
---|
419 | of data from the "target" file was used as the corresponding
|
---|
420 | source window of data to encode the target window. As above, this
|
---|
421 | same source data segment is used to decode the target window.
|
---|
422 |
|
---|
423 | The Win_Indicator byte MUST NOT have more than one of the bits
|
---|
424 | set (non-zero). It MAY have none of these bits set.
|
---|
425 |
|
---|
426 | If one of these bits is set, the byte is followed by two
|
---|
427 | integers to indicate respectively the length and position of
|
---|
428 | the source data segment in the relevant file. If the
|
---|
429 | indicator byte is zero, the target window was compressed
|
---|
430 | by itself without comparing against another data segment,
|
---|
431 | and these two integers are not included.
|
---|
432 |
|
---|
433 | The delta encoding of the target window:
|
---|
434 | This contains the delta encoding of the target window either
|
---|
435 | in terms of the source data segment (i.e., VCD_SOURCE
|
---|
436 | or VCD_TARGET was set) or by itself if no source window
|
---|
437 | is specified. This data format is discussed next.
|
---|
438 |
|
---|
439 |
|
---|
440 | 4.3 The Delta Encoding of a Target Window
|
---|
441 |
|
---|
442 | The delta encoding of a target window is organized as follows:
|
---|
443 |
|
---|
444 | Length of the delta encoding - integer
|
---|
445 | The delta encoding
|
---|
446 | Length of the target window - integer
|
---|
447 | Delta_Indicator - byte
|
---|
448 | Length of data for ADDs and RUNs - integer
|
---|
449 | Length of instructions section - integer
|
---|
450 | Length of addresses for COPYs - integer
|
---|
451 | Data section for ADDs and RUNs - array of bytes
|
---|
452 | Instructions and sizes section - array of bytes
|
---|
453 | Addresses section for COPYs - array of bytes
|
---|
454 |
|
---|
455 |
|
---|
456 | Length of the delta encoding:
|
---|
457 | This integer gives the total number of remaining bytes that
|
---|
458 | comprise data of the delta encoding for this target window.
|
---|
459 |
|
---|
460 | The delta encoding:
|
---|
461 | This contains the data representing the delta encoding which
|
---|
462 | is described next.
|
---|
463 |
|
---|
464 | Length of the target window:
|
---|
465 | This integer indicates the actual size of the target window
|
---|
466 | after decompression. A decoder can use this value to allocate
|
---|
467 | memory to store the uncompressed data.
|
---|
468 |
|
---|
469 | Delta_Indicator:
|
---|
470 | This byte is a set of bits, as shown:
|
---|
471 |
|
---|
472 | 7 6 5 4 3 2 1 0
|
---|
473 | +-+-+-+-+-+-+-+-+
|
---|
474 | | | | | | | | | |
|
---|
475 | +-+-+-+-+-+-+-+-+
|
---|
476 | ^ ^ ^
|
---|
477 | | | |
|
---|
478 | | | +-- VCD_DATACOMP
|
---|
479 | | +---- VCD_INSTCOMP
|
---|
480 | +------ VCD_ADDRCOMP
|
---|
481 |
|
---|
482 | VCD_DATACOMP: bit value 1.
|
---|
483 | VCD_INSTCOMP: bit value 2.
|
---|
484 | VCD_ADDRCOMP: bit value 4.
|
---|
485 |
|
---|
486 | As discussed, the delta encoding consists of COPY, ADD and RUN
|
---|
487 | instructions. The ADD and RUN instructions have accompanying
|
---|
488 | unmatched data (that is, data that does not specifically match
|
---|
489 | any data in the source window or in some earlier part of the
|
---|
490 | target window) and the COPY instructions have addresses of where
|
---|
491 | the matches occur. OPTIONALLY, these types of data MAY be further
|
---|
492 | compressed using a secondary compressor. Thus, Vcdiff separates
|
---|
493 | the encoding of the delta instructions into three parts:
|
---|
494 |
|
---|
495 | a. The unmatched data in the ADD and RUN instructions,
|
---|
496 | b. The delta instructions and accompanying sizes, and
|
---|
497 | c. The addresses of the COPY instructions.
|
---|
498 |
|
---|
499 | If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these
|
---|
500 | sections may have been compressed using the specified secondary
|
---|
501 | compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP),
|
---|
502 | and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that
|
---|
503 | the corresponding parts are compressed. Then, these parts MUST
|
---|
504 | be decompressed before decoding the delta instructions.
|
---|
505 |
|
---|
506 | Length of data for ADDs and RUNs:
|
---|
507 | This is the length (in bytes) of the section of data storing
|
---|
508 | the unmatched data accompanying the ADD and RUN instructions.
|
---|
509 |
|
---|
510 | Length of instructions section:
|
---|
511 | This is the length (in bytes) of the delta instructions and
|
---|
512 | accompanying sizes.
|
---|
513 |
|
---|
514 | Length of addresses for COPYs:
|
---|
515 | This is the length (in bytes) of the section storing
|
---|
516 | the addresses of the COPY instructions.
|
---|
517 |
|
---|
518 | Data section for ADDs and RUNs:
|
---|
519 | This sequence of bytes encodes the unmatched data for the ADD
|
---|
520 | and RUN instructions.
|
---|
521 |
|
---|
522 | Instructions and sizes section:
|
---|
523 | This sequence of bytes encodes the instructions and their sizes.
|
---|
524 |
|
---|
525 | Addresses section for COPYs:
|
---|
526 | This sequence of bytes encodes the addresses of the COPY
|
---|
527 | instructions.
|
---|
528 |
|
---|
529 |
|
---|
530 | 5. DELTA INSTRUCTION ENCODING
|
---|
531 |
|
---|
532 | The delta instructions described in Section 3 represent the results of
|
---|
533 | string matching. For many data differencing applications in which the
|
---|
534 | changes between source and target data are small, any straightforward
|
---|
535 | representation of these instructions would be adequate. However, for
|
---|
536 | applications including data compression, it is important to encode
|
---|
537 | these instructions well to achieve good compression rates. From our
|
---|
538 | experience, the following observations can be made:
|
---|
539 |
|
---|
540 | a. The addresses in COPY instructions are locations of matches and
|
---|
541 | often occur close by or even exactly equal to one another. This is
|
---|
542 | because data in local regions are often replicated with minor changes.
|
---|
543 | In turn, this means that coding a newly matched address against some
|
---|
544 | set of recently matched addresses can be beneficial.
|
---|
545 |
|
---|
546 | b. The matches are often short in length and separated by small amounts
|
---|
547 | of unmatched data. That is, the lengths of COPY and ADD instructions
|
---|
548 | are often small. This is particularly true of binary data such as
|
---|
549 | executable files or structured data such as HTML or XML. In such cases,
|
---|
550 | compression can be improved by combining the encoding of the sizes
|
---|
551 | and the instruction types as well as combining the encoding of adjacent
|
---|
552 | delta instructions with sufficiently small data sizes.
|
---|
553 |
|
---|
554 | The below subsections discuss how the Vcdiff data format provides
|
---|
555 | mechanisms enabling encoders to use the above observations to improve
|
---|
556 | compression rates.
|
---|
557 |
|
---|
558 |
|
---|
559 | 5.1 Address Encoding Modes of COPY Instructions
|
---|
560 |
|
---|
561 | As mentioned earlier, addresses of COPY instructions often occur close
|
---|
562 | to one another or are exactly equal. To take advantage of this phenomenon
|
---|
563 | and encode addresses of COPY instructions more efficiently, the Vcdiff
|
---|
564 | data format supports the use of two different types of address caches.
|
---|
565 | Both the encoder and decoder maintain these caches, so that decoder's
|
---|
566 | caches remain synchronized with the encoder's caches.
|
---|
567 |
|
---|
568 | a. A "near" cache is an array with "s_near" slots, each containing an
|
---|
569 | address used for encoding addresses nearby to previously encoded
|
---|
570 | addresses (in the positive direction only). The near cache also
|
---|
571 | maintains a "next_slot" index to the near cache. New entries to the
|
---|
572 | near cache are always inserted in the next_slot index, which maintains
|
---|
573 | a circular buffer of the s_near most recent addresses.
|
---|
574 |
|
---|
575 | b. A "same" cache is an array with "s_same" multiple of 256 slots, each
|
---|
576 | containing an address. The same cache maintains a hash table of recent
|
---|
577 | addresses used for repeated encoding of the exact same address.
|
---|
578 |
|
---|
579 |
|
---|
580 | By default, the parameters s_near and s_same are respectively set to 4
|
---|
581 | and 3. An encoder MAY modify these values, but then it MUST encode the
|
---|
582 | new values in the encoding itself, as discussed in Section 7, so that
|
---|
583 | the decoder can properly set up its own caches.
|
---|
584 |
|
---|
585 | At the start of processing a target window, an implementation
|
---|
586 | (encoder or decoder) initializes all of the slots in both caches
|
---|
587 | to zero. The next_slot pointer of the near cache is set
|
---|
588 | to point to slot zero.
|
---|
589 |
|
---|
590 | Each time a COPY instruction is processed by the encoder or
|
---|
591 | decoder, the implementation's caches are updated as follows, where
|
---|
592 | "addr" is the address in the COPY instruction.
|
---|
593 |
|
---|
594 | a. The slot in the near cache referenced by the next_slot
|
---|
595 | index is set to addr. The next_slot index is then incremented
|
---|
596 | modulo s_near.
|
---|
597 |
|
---|
598 | b. The slot in the same cache whose index is addr%(s_same*256)
|
---|
599 | is set to addr. [We use the C notations of % for modulo and
|
---|
600 | * for multiplication.]
|
---|
601 |
|
---|
602 |
|
---|
603 | 5.2 Example code for maintaining caches
|
---|
604 |
|
---|
605 | To make clear the above description, below are example cache data
|
---|
606 | structures and algorithms to initialize and update them:
|
---|
607 |
|
---|
608 | typedef struct _cache_s
|
---|
609 | {
|
---|
610 | int* near; /* array of size s_near */
|
---|
611 | int s_near;
|
---|
612 | int next_slot; /* the circular index for near */
|
---|
613 | int* same; /* array of size s_same*256 */
|
---|
614 | int s_same;
|
---|
615 | } Cache_t;
|
---|
616 |
|
---|
617 | cache_init(Cache_t* ka)
|
---|
618 | {
|
---|
619 | int i;
|
---|
620 |
|
---|
621 | ka->next_slot = 0;
|
---|
622 | for(i = 0; i < ka->s_near; ++i)
|
---|
623 | ka->near[i] = 0;
|
---|
624 |
|
---|
625 | for(i = 0; i < ka->s_same*256; ++i)
|
---|
626 | ka->same[i] = 0;
|
---|
627 | }
|
---|
628 |
|
---|
629 | cache_update(Cache_t* ka, int addr)
|
---|
630 | {
|
---|
631 | if(ka->s_near > 0)
|
---|
632 | { ka->near[ka->next_slot] = addr;
|
---|
633 | ka->next_slot = (ka->next_slot + 1) % ka->s_near;
|
---|
634 | }
|
---|
635 |
|
---|
636 | if(ka->s_same > 0)
|
---|
637 | ka->same[addr % (ka->s_same*256)] = addr;
|
---|
638 | }
|
---|
639 |
|
---|
640 |
|
---|
641 | 5.3 Encoding of COPY instruction addresses
|
---|
642 |
|
---|
643 | The address of a COPY instruction is encoded using different modes
|
---|
644 | depending on the type of cached address used, if any.
|
---|
645 |
|
---|
646 | Let "addr" be the address of a COPY instruction to be decoded and "here"
|
---|
647 | be the current location in the target data (i.e., the start of the data
|
---|
648 | about to be encoded or decoded). Let near[j] be the jth element in
|
---|
649 | the near cache, and same[k] be the kth element in the same cache.
|
---|
650 | Below are the possible address modes:
|
---|
651 |
|
---|
652 | VCD_SELF: This mode has value 0. The address was encoded by itself
|
---|
653 | as an integer.
|
---|
654 |
|
---|
655 | VCD_HERE: This mode has value 1. The address was encoded as
|
---|
656 | the integer value "here - addr".
|
---|
657 |
|
---|
658 | Near modes: The "near modes" are in the range [2,s_near+1]. Let m
|
---|
659 | be the mode of the address encoding. The address was encoded
|
---|
660 | as the integer value "addr - near[m-2]".
|
---|
661 |
|
---|
662 | Same modes: The "same modes" are in the range
|
---|
663 | [s_near+2,s_near+s_same+1]. Let m be the mode of the encoding.
|
---|
664 | The address was encoded as a single byte b such that
|
---|
665 | "addr == same[(m - (s_near+2))*256 + b]".
|
---|
666 |
|
---|
667 |
|
---|
668 | 5.3 Example code for encoding and decoding of COPY instruction addresses
|
---|
669 |
|
---|
670 | We show example algorithms below to demonstrate use of address modes more
|
---|
671 | clearly. The encoder has freedom to choose address modes, the sample
|
---|
672 | addr_encode() algorithm merely shows one way of picking the address
|
---|
673 | mode. The decoding algorithm addr_decode() will uniquely decode addresses
|
---|
674 | regardless of the encoder's algorithm choice.
|
---|
675 |
|
---|
676 | Note that the address caches are updated immediately after an address is
|
---|
677 | encoded or decoded. In this way, the decoder is always synchronized with
|
---|
678 | the encoder.
|
---|
679 |
|
---|
680 | int addr_encode(Cache_t* ka, int addr, int here, int* mode)
|
---|
681 | {
|
---|
682 | int i, d, bestd, bestm;
|
---|
683 |
|
---|
684 | /* Attempt to find the address mode that yields the
|
---|
685 | * smallest integer value for "d", the encoded address
|
---|
686 | * value, thereby minimizing the encoded size of the
|
---|
687 | * address. */
|
---|
688 |
|
---|
689 | bestd = addr; bestm = VCD_SELF; /* VCD_SELF == 0 */
|
---|
690 |
|
---|
691 | if((d = here-addr) < bestd)
|
---|
692 | { bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */
|
---|
693 |
|
---|
694 | for(i = 0; i < ka->s_near; ++i)
|
---|
695 | if((d = addr - ka->near[i]) >= 0 && d < bestd)
|
---|
696 | { bestd = d; bestm = i+2; }
|
---|
697 |
|
---|
698 | if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)
|
---|
699 | { bestd = d%256; bestm = ka->s_near + 2 + d/256; }
|
---|
700 |
|
---|
701 | cache_update(ka,addr);
|
---|
702 |
|
---|
703 | *mode = bestm; /* this returns the address encoding mode */
|
---|
704 | return bestd; /* this returns the encoded address */
|
---|
705 | }
|
---|
706 |
|
---|
707 | Note that the addr_encode() algorithm chooses the best address mode using a
|
---|
708 | local optimization, but that may not lead to the best encoding efficiency
|
---|
709 | because different modes lead to different instruction encodings, as described below.
|
---|
710 |
|
---|
711 | The functions addrint() and addrbyte() used in addr_decode() obtain from
|
---|
712 | the "Addresses section for COPYs" (Section 4.3) an integer or a byte,
|
---|
713 | respectively. These utilities will not be described here. We simply
|
---|
714 | recall that an integer is represented as a compact variable-sized string
|
---|
715 | of bytes as described in Section 2 (i.e., base 128).
|
---|
716 |
|
---|
717 | int addr_decode(Cache_t* ka, int here, int mode)
|
---|
718 | { int addr, m;
|
---|
719 |
|
---|
720 | if(mode == VCD_SELF)
|
---|
721 | addr = addrint();
|
---|
722 | else if(mode == VCD_HERE)
|
---|
723 | addr = here - addrint();
|
---|
724 | else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
|
---|
725 | addr = ka->near[m] + addrint();
|
---|
726 | else /* same cache */
|
---|
727 | { m = mode - (2 + ka->s_near);
|
---|
728 | addr = ka->same[m*256 + addrbyte()];
|
---|
729 | }
|
---|
730 |
|
---|
731 | cache_update(ka, addr);
|
---|
732 |
|
---|
733 | return addr;
|
---|
734 | }
|
---|
735 |
|
---|
736 |
|
---|
737 | 5.4 Instruction Codes
|
---|
738 |
|
---|
739 | As noted, the data sizes associated with delta instructions are often
|
---|
740 | small. Thus, compression efficiency can be improved by combining the sizes
|
---|
741 | and instruction types in a single encoding, as well by combining certain
|
---|
742 | pairs of adjacent delta instructions. Effective choices of when to perform
|
---|
743 | such combinations depend on many factors including the data being processed
|
---|
744 | and the string matching algorithm in use. For example, if many COPY
|
---|
745 | instructions have the same data sizes, it may be worth to encode these
|
---|
746 | instructions more compactly than others.
|
---|
747 |
|
---|
748 | The Vcdiff data format is designed so that a decoder does not need to be
|
---|
749 | aware of the choices made in encoding algorithms. This is achieved with the
|
---|
750 | notion of an "instruction code table" containing 256 entries. Each entry
|
---|
751 | defines either a single delta instruction or a pair of instructions that
|
---|
752 | have been combined. Note that the code table itself only exists in main
|
---|
753 | memory, not in the delta file (unless using an application-defined code
|
---|
754 | table, described in Section 7). The encoded data simply includes the index
|
---|
755 | of each instruction and, since there are only 256 indices, each index
|
---|
756 | can be represented as a single byte.
|
---|
757 |
|
---|
758 | Each instruction code entry contains six fields, each of which
|
---|
759 | is a single byte with unsigned value:
|
---|
760 |
|
---|
761 | +-----------------------------------------------+
|
---|
762 | | inst1 | size1 | mode1 | inst2 | size2 | mode2 |
|
---|
763 | +-----------------------------------------------+
|
---|
764 |
|
---|
765 | @@@ could be more compact
|
---|
766 |
|
---|
767 | Each triple (inst,size,mode) defines a delta instruction. The meanings
|
---|
768 | of these fields are as follows:
|
---|
769 |
|
---|
770 | inst: An "inst" field can have one of the four values: NOOP (0), ADD (1),
|
---|
771 | RUN (2) or COPY (3) to indicate the instruction types. NOOP means
|
---|
772 | that no instruction is specified. In this case, both the corresponding
|
---|
773 | size and mode fields will be zero.
|
---|
774 |
|
---|
775 | size: A "size" field is zero or positive. A value zero means that the
|
---|
776 | size associated with the instruction is encoded separately as
|
---|
777 | an integer in the "Instructions and sizes section" (Section 6).
|
---|
778 | A positive value for "size" defines the actual data size.
|
---|
779 | Note that since the size is restricted to a byte, the maximum
|
---|
780 | value for any instruction with size implicitly defined in the code
|
---|
781 | table is 255.
|
---|
782 |
|
---|
783 | mode: A "mode" field is significant only when the associated delta
|
---|
784 | instruction is a COPY. It defines the mode used to encode the
|
---|
785 | associated addresses. For other instructions, this is always zero.
|
---|
786 |
|
---|
787 |
|
---|
788 | 5.5 The Code Table
|
---|
789 |
|
---|
790 | Following the discussions on address modes and instruction code tables,
|
---|
791 | we define a "Code Table" to have the data below:
|
---|
792 |
|
---|
793 | s_near: the size of the near cache,
|
---|
794 | s_same: the size of the same cache,
|
---|
795 | i_code: the 256-entry instruction code table.
|
---|
796 |
|
---|
797 | Vcdiff itself defines a "default code table" in which s_near is 4
|
---|
798 | and s_same is 3. Thus, there are 9 address modes for a COPY instruction.
|
---|
799 | The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5
|
---|
800 | are for addresses coded against the near cache. And, modes 6, 7 and 8
|
---|
801 | are for addresses coded against the same cache.
|
---|
802 |
|
---|
803 | The default instruction code table is depicted below, in a compact
|
---|
804 | representation that we use only for descriptive purposes. See section 7
|
---|
805 | for the specification of how an instruction code table is represented
|
---|
806 | in the Vcdiff encoding format. In the depiction, a zero value for
|
---|
807 | size indicates that the size is separately coded. The mode of non-COPY
|
---|
808 | instructions is represented as 0 even though they are not used.
|
---|
809 |
|
---|
810 |
|
---|
811 | TYPE SIZE MODE TYPE SIZE MODE INDEX
|
---|
812 | ---------------------------------------------------------------
|
---|
813 | 1. RUN 0 0 NOOP 0 0 0
|
---|
814 | 2. ADD 0, [1,17] 0 NOOP 0 0 [1,18]
|
---|
815 | 3. COPY 0, [4,18] 0 NOOP 0 0 [19,34]
|
---|
816 | 4. COPY 0, [4,18] 1 NOOP 0 0 [35,50]
|
---|
817 | 5. COPY 0, [4,18] 2 NOOP 0 0 [51,66]
|
---|
818 | 6. COPY 0, [4,18] 3 NOOP 0 0 [67,82]
|
---|
819 | 7. COPY 0, [4,18] 4 NOOP 0 0 [83,98]
|
---|
820 | 8. COPY 0, [4,18] 5 NOOP 0 0 [99,114]
|
---|
821 | 9. COPY 0, [4,18] 6 NOOP 0 0 [115,130]
|
---|
822 | 10. COPY 0, [4,18] 7 NOOP 0 0 [131,146]
|
---|
823 | 11. COPY 0, [4,18] 8 NOOP 0 0 [147,162]
|
---|
824 | 12. ADD [1,4] 0 COPY [4,6] 0 [163,174]
|
---|
825 | 13. ADD [1,4] 0 COPY [4,6] 1 [175,186]
|
---|
826 | 14. ADD [1,4] 0 COPY [4,6] 2 [187,198]
|
---|
827 | 15. ADD [1,4] 0 COPY [4,6] 3 [199,210]
|
---|
828 | 16. ADD [1,4] 0 COPY [4,6] 4 [211,222]
|
---|
829 | 17. ADD [1,4] 0 COPY [4,6] 5 [223,234]
|
---|
830 | 18. ADD [1,4] 0 COPY 4 6 [235,238]
|
---|
831 | 19. ADD [1,4] 0 COPY 4 7 [239,242]
|
---|
832 | 20. ADD [1,4] 0 COPY 4 8 [243,246]
|
---|
833 | 21. COPY 4 [0,8] ADD 1 0 [247,255]
|
---|
834 | ---------------------------------------------------------------
|
---|
835 |
|
---|
836 | In the above depiction, each numbered line represents one or more
|
---|
837 | entries in the actual instruction code table (recall that an entry in
|
---|
838 | the instruction code table may represent up to two combined delta
|
---|
839 | instructions.) The last column ("INDEX") shows which index value or
|
---|
840 | range of index values of the entries covered by that line. The notation
|
---|
841 | [i,j] means values from i through j, inclusive. The first 6 columns of
|
---|
842 | a line in the depiction describe the pairs of instructions used for
|
---|
843 | the corresponding index value(s).
|
---|
844 |
|
---|
845 | If a line in the depiction includes a column entry using the [i,j]
|
---|
846 | notation, this means that the line is instantiated for each value
|
---|
847 | in the range from i to j, inclusive. The notation "0, [i,j]" means
|
---|
848 | that the line is instantiated for the value 0 and for each value
|
---|
849 | in the range from i to j, inclusive.
|
---|
850 |
|
---|
851 | If a line in the depiction includes more than one entry using the [i,j]
|
---|
852 | notation, implying a "nested loop" to convert the line to a range of
|
---|
853 | table entries, the first such [i,j] range specifies the outer loop,
|
---|
854 | and the second specifies the inner loop.
|
---|
855 |
|
---|
856 | The below examples should make clear the above description:
|
---|
857 |
|
---|
858 | Line 1 shows the single RUN instruction with index 0. As the size field
|
---|
859 | is 0, this RUN instruction always has its actual size encoded separately.
|
---|
860 |
|
---|
861 | Line 2 shows the 18 single ADD instructions. The ADD instruction with
|
---|
862 | size field 0 (i.e., the actual size is coded separately) has index 1.
|
---|
863 | ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and
|
---|
864 | their sizes are as given (so they will not be separately encoded.)
|
---|
865 |
|
---|
866 | Following the single ADD instructions are the single COPY instructions
|
---|
867 | ordered by their address encoding modes. For example, line 11 shows the
|
---|
868 | COPY instructions with mode 8, i.e., the last of the same cache.
|
---|
869 | In this case, the COPY instruction with size field 0 has index 147.
|
---|
870 | Again, the actual size of this instruction will be coded separately.
|
---|
871 |
|
---|
872 | Lines 12 to 21 show the pairs of instructions that are combined together.
|
---|
873 | For example, line 12 depicts the 12 entries in which an ADD instruction
|
---|
874 | is combined with an immediately following COPY instruction. The entries
|
---|
875 | with indices 163, 164, 165 represent the pairs in which the ADD
|
---|
876 | instructions all have size 1 while the COPY instructions has mode
|
---|
877 | 0 (VCD_SELF) and sizes 4, 5 and 6 respectively.
|
---|
878 |
|
---|
879 | The last line, line 21, shows the eight instruction pairs where the first
|
---|
880 | instruction is a COPY and the second is an ADD. In this case, all COPY
|
---|
881 | instructions have size 4 with mode ranging from 0 to 8 and all the ADD
|
---|
882 | instructions have size 1. Thus, the entry with largest index 255
|
---|
883 | combines a COPY instruction of size 4 and mode 8 with an ADD instruction
|
---|
884 | of size 1.
|
---|
885 |
|
---|
886 | The choice of the minimum size 4 for COPY instructions in the default code
|
---|
887 | table was made from experiments that showed that excluding small matches
|
---|
888 | (less then 4 bytes long) improved the compression rates.
|
---|
889 |
|
---|
890 |
|
---|
891 | 6. DECODING A TARGET WINDOW
|
---|
892 |
|
---|
893 | Section 4.3 discusses that the delta instructions and associated data
|
---|
894 | are encoded in three arrays of bytes:
|
---|
895 |
|
---|
896 | Data section for ADDs and RUNs,
|
---|
897 | Instructions and sizes section, and
|
---|
898 | Addresses section for COPYs.
|
---|
899 |
|
---|
900 |
|
---|
901 | Further, these data sections may have been further compressed by some
|
---|
902 | secondary compressor. Assuming that any such compressed data has been
|
---|
903 | decompressed so that we now have three arrays:
|
---|
904 |
|
---|
905 | inst: bytes coding the instructions and sizes.
|
---|
906 | data: unmatched data associated with ADDs and RUNs.
|
---|
907 | addr: bytes coding the addresses of COPYs.
|
---|
908 |
|
---|
909 | These arrays are organized as follows:
|
---|
910 |
|
---|
911 | inst:
|
---|
912 | a sequence of (index, [size1], [size2]) tuples, where "index"
|
---|
913 | is an index into the instruction code table, and size1 and size2
|
---|
914 | are integers that MAY or MAY NOT be included in the tuple as
|
---|
915 | follows. The entry with the given "index" in the instruction
|
---|
916 | code table potentially defines two delta instructions. If the
|
---|
917 | first delta instruction is not a VCD_NOOP and its size is zero,
|
---|
918 | then size1 MUST be present. Otherwise, size1 MUST be omitted and
|
---|
919 | the size of the instruction (if it is not VCD_NOOP) is as defined
|
---|
920 | in the table. The presence or absence of size2 is defined
|
---|
921 | similarly with respect to the second delta instruction.
|
---|
922 |
|
---|
923 | data:
|
---|
924 | a sequence of data values, encoded as bytes.
|
---|
925 |
|
---|
926 | addr:
|
---|
927 | a sequence of address values. Addresses are normally encoded as
|
---|
928 | integers as described in Section 2 (i.e., base 128).
|
---|
929 | Since the same cache emits addresses in the range [0,255],
|
---|
930 | however, same cache addresses are always encoded as a
|
---|
931 | single byte.
|
---|
932 |
|
---|
933 | To summarize, each tuple in the "inst" array includes an index to some
|
---|
934 | entry in the instruction code table that determines:
|
---|
935 |
|
---|
936 | a. Whether one or two instructions were encoded and their types.
|
---|
937 |
|
---|
938 | b. If the instructions have their sizes encoded separately, these
|
---|
939 | sizes will follow, in order, in the tuple.
|
---|
940 |
|
---|
941 | c. If the instructions have accompanying data, i.e., ADDs or RUNs,
|
---|
942 | their data will be in the array "data".
|
---|
943 |
|
---|
944 | d. Similarly, if the instructions are COPYs, the coded addresses are
|
---|
945 | found in the array "addr".
|
---|
946 |
|
---|
947 | The decoding procedure simply processes the arrays by reading one code
|
---|
948 | index at a time, looking up the corresponding instruction code entry,
|
---|
949 | then consuming the respective sizes, data and addresses following the
|
---|
950 | directions in this entry. In other words, the decoder maintains an implicit
|
---|
951 | next-element pointer for each array; "consuming" an instruction tuple,
|
---|
952 | data, or address value implies incrementing the associated pointer.
|
---|
953 |
|
---|
954 | For example, if during the processing of the target window, the next
|
---|
955 | unconsumed tuple in the inst array has index value 19, then the first
|
---|
956 | instruction is a COPY, whose size is found as the immediately following
|
---|
957 | integer in the inst array. Since the mode of this COPY instruction is
|
---|
958 | VCD_SELF, the corresponding address is found by consuming the next
|
---|
959 | integer in the addr array. The data array is left intact. As the second
|
---|
960 | instruction for code index 19 is a NOOP, this tuple is finished.
|
---|
961 |
|
---|
962 |
|
---|
963 | 7. APPLICATION-DEFINED CODE TABLES
|
---|
964 |
|
---|
965 | Although the default code table used in Vcdiff is good for general
|
---|
966 | purpose encoders, there are times when other code tables may perform
|
---|
967 | better. For example, to code a file with many identical segments of data,
|
---|
968 | it may be advantageous to have a COPY instruction with the specific size
|
---|
969 | of these data segments so that the instruction can be encoded in a single
|
---|
970 | byte. Such a special code table MUST then be encoded in the delta file
|
---|
971 | so that the decoder can reconstruct it before decoding the data.
|
---|
972 |
|
---|
973 | Vcdiff allows an application-defined code table to be specified
|
---|
974 | in a delta file with the following data:
|
---|
975 |
|
---|
976 | Size of near cache - byte
|
---|
977 | Size of same cache - byte
|
---|
978 | Compressed code table data
|
---|
979 |
|
---|
980 | The "compressed code table data" encodes the delta between the default
|
---|
981 | code table (source) and the new code table (target) in the same manner as
|
---|
982 | described in Section 4.3 for encoding a target window in terms of a
|
---|
983 | source window. This delta is computed using the following steps:
|
---|
984 |
|
---|
985 | a. Convert the new instruction code table into a string, "code", of
|
---|
986 | 1536 bytes using the below steps in order:
|
---|
987 |
|
---|
988 | i. Add in order the 256 bytes representing the types of the first
|
---|
989 | instructions in the instruction pairs.
|
---|
990 | ii. Add in order the 256 bytes representing the types of the second
|
---|
991 | instructions in the instruction pairs.
|
---|
992 | iii. Add in order the 256 bytes representing the sizes of the first
|
---|
993 | instructions in the instruction pairs.
|
---|
994 | iv. Add in order the 256 bytes representing the sizes of the second
|
---|
995 | instructions in the instruction pairs.
|
---|
996 | v. Add in order the 256 bytes representing the modes of the first
|
---|
997 | instructions in the instruction pairs.
|
---|
998 | vi. Add in order the 256 bytes representing the modes of the second
|
---|
999 | instructions in the instruction pairs.
|
---|
1000 |
|
---|
1001 | b. Similarly, convert the default instruction code table into
|
---|
1002 | a string "dflt".
|
---|
1003 |
|
---|
1004 | c. Treat the string "code" as a target window and "dflt" as the
|
---|
1005 | corresponding source data and apply an encoding algorithm to
|
---|
1006 | compute the delta encoding of "code" in terms of "dflt".
|
---|
1007 | This computation MUST use the default code table for encoding
|
---|
1008 | the delta instructions.
|
---|
1009 |
|
---|
1010 | The decoder can then reverse the above steps to decode the compressed
|
---|
1011 | table data using the method of Section 6, employing the default code
|
---|
1012 | table, to generate the new code table. Note that the decoder does not
|
---|
1013 | need to know anything about the details of the encoding algorithm used
|
---|
1014 | in step (c). The decoder is still able to decode the new code table
|
---|
1015 | because the Vcdiff format is independent from the choice of encoding
|
---|
1016 | algorithm, and because the encoder in step (c) uses the known, default
|
---|
1017 | code table.
|
---|
1018 |
|
---|
1019 |
|
---|
1020 | 8. PERFORMANCE
|
---|
1021 |
|
---|
1022 | The encoding format is compact. For compression only, using the LZ-77
|
---|
1023 | string parsing strategy and without any secondary compressors, the typical
|
---|
1024 | compression rate is better than Unix compress and close to gzip. For
|
---|
1025 | differencing, the data format is better than all known methods in
|
---|
1026 | terms of its stated goal, which is primarily decoding speed and
|
---|
1027 | encoding efficiency.
|
---|
1028 |
|
---|
1029 | We compare the performance of compress, gzip and Vcdiff using the
|
---|
1030 | archives of three versions of the Gnu C compiler, gcc-2.95.1.tar,
|
---|
1031 | gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an
|
---|
1032 | SGI-MIPS3, 400MHZ. Gzip was used at its default compression level.
|
---|
1033 | Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13).
|
---|
1034 | As string and window matching typically dominates the computation during
|
---|
1035 | compression, the Vcdiff compression times were directly due to the
|
---|
1036 | algorithms used in the Vcodex/Vcdiff software. However, the decompression
|
---|
1037 | times should be generic and representative of any good implementation
|
---|
1038 | of the Vcdiff data format. Timing was done by running each program
|
---|
1039 | three times and taking the average of the total cpu+system times.
|
---|
1040 |
|
---|
1041 | Below are the different Vcdiff runs:
|
---|
1042 |
|
---|
1043 | Vcdiff: vcdiff is used as compressor only.
|
---|
1044 |
|
---|
1045 | Vcdiff-d: vcdiff is used as a differencer only. That is, it only
|
---|
1046 | compares target data against source data. Since the files
|
---|
1047 | involved are large, they are broken into windows. In this
|
---|
1048 | case, each target window starting at some file offset in
|
---|
1049 | the target file is compared against a source window with
|
---|
1050 | the same file offset (in the source file). The source
|
---|
1051 | window is also slightly larger than the target window
|
---|
1052 | to increase matching opportunities. The -d option also gives
|
---|
1053 | a hint to the string matching algorithm of Vcdiff that
|
---|
1054 | the two files are very similar with long stretches of matches.
|
---|
1055 | The algorithm takes advantage of this to minimize its
|
---|
1056 | processing of source data and save time.
|
---|
1057 |
|
---|
1058 | Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare
|
---|
1059 | target data against target data as applicable. Thus, vcdiff
|
---|
1060 | both computes differences and compresses data. The windowing
|
---|
1061 | algorithm is the same as above. However, the above hint is
|
---|
1062 | recinded in this case.
|
---|
1063 |
|
---|
1064 | Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm
|
---|
1065 | uses a content-based heuristic to select source data segments
|
---|
1066 | that are more likely to match with a given target window.
|
---|
1067 | Thus, the source data segment selected for a target window
|
---|
1068 | often will not be aligned with the file offsets of this
|
---|
1069 | target window.
|
---|
1070 |
|
---|
1071 |
|
---|
1072 | gcc-2.95.1 gcc-2.95.2 compression decompression
|
---|
1073 | raw size 55746560 55797760
|
---|
1074 | compress - 19939390 13.85s 7.09s
|
---|
1075 | gzip - 12973443 42.99s 5.35s
|
---|
1076 | Vcdiff - 15358786 20.04s 4.65s
|
---|
1077 | Vcdiff-d - 100971 10.93s 1.92s
|
---|
1078 | Vcdiff-dc - 97246 20.03s 1.84s
|
---|
1079 | Vcdiff-dcs - 256445 44.81s 1.84s
|
---|
1080 |
|
---|
1081 | TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1
|
---|
1082 |
|
---|
1083 |
|
---|
1084 | TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the
|
---|
1085 | sizes of the compressed results. As a pure compressor, the compression
|
---|
1086 | rate for Vcdiff is worse than gzip and better than compress. The last
|
---|
1087 | three rows shows that when two file versions are very similar, differencing
|
---|
1088 | can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use
|
---|
1089 | the same simple window selection method but Vcdiff-dc also does compression
|
---|
1090 | so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on
|
---|
1091 | data content to search for source data that likely will match a given target
|
---|
1092 | window. Although it does a good job, the heuristic did not always find the
|
---|
1093 | best matches which are given by the simple algorithm of Vcdiff-d. As a
|
---|
1094 | result, the output size is slightly larger. Note also that there is a large
|
---|
1095 | cost in computing matching windows this way. Finally, the compression times
|
---|
1096 | of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude
|
---|
1097 | that the compression feature causes the additional time in Vcdiff-dc
|
---|
1098 | relative to Vcdiff-d. However, this is not the case. The hint given to
|
---|
1099 | the Vcdiff string matching algorithm that the two files are likely to
|
---|
1100 | have very long stretches of matches helps the algorithm to minimize
|
---|
1101 | processing of the "source data", thus saving half the time. However, as we
|
---|
1102 | shall see below when this hint is wrong, the result is even longer time.
|
---|
1103 |
|
---|
1104 |
|
---|
1105 | gcc-2.95.2 gcc-2.95.3 compression decompression
|
---|
1106 | raw size 55797760 55787520
|
---|
1107 | compress - 19939453 13.54s 7.00s
|
---|
1108 | gzip - 12998097 42.63s 5.62s
|
---|
1109 | Vcdiff - 15371737 20.09s 4.74s
|
---|
1110 | Vcdiff-d - 26383849 71.41s 6.41s
|
---|
1111 | Vcdiff-dc - 14461203 42.48s 4.82s
|
---|
1112 | Vcdiff-dcs - 1248543 61.18s 1.99s
|
---|
1113 |
|
---|
1114 | TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2
|
---|
1115 |
|
---|
1116 |
|
---|
1117 | TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and
|
---|
1118 | the sizes of the compressed results. In this case, the tar file of
|
---|
1119 | gcc-2.95.3 is rearranged in a way that makes the straightforward method
|
---|
1120 | of matching file offsets for source and target windows fail. As a
|
---|
1121 | result, Vcdiff-d performs rather dismally both in time and output size.
|
---|
1122 | The large time for Vcdiff-d is directly due to fact that the string
|
---|
1123 | matching algorithm has to work much harder to find matches when the hint
|
---|
1124 | that two files have long matching stretches fails to hold. On the other
|
---|
1125 | hand, Vcdiff-dc does both differencing and compression resulting in good
|
---|
1126 | output size. Finally, the window searching heuristic used in Vcdiff-dcs is
|
---|
1127 | effective in finding the right matching source windows for target windows
|
---|
1128 | resulting a small output size. This shows why the data format needs to
|
---|
1129 | have a way to specify matching windows to gain performance. Finally,
|
---|
1130 | we note that the decoding times are always good regardless of how
|
---|
1131 | the string matching or window searching algorithms perform.
|
---|
1132 |
|
---|
1133 |
|
---|
1134 | 9. FURTHER ISSUES
|
---|
1135 |
|
---|
1136 | This document does not address a few issues:
|
---|
1137 |
|
---|
1138 | Secondary compressors:
|
---|
1139 | As discussed in Section 4.3, certain sections in the delta encoding
|
---|
1140 | of a window may be further compressed by a secondary compressor.
|
---|
1141 | In our experience, the basic Vcdiff format is adequate for most
|
---|
1142 | purposes so that secondary compressors are seldom needed. In
|
---|
1143 | particular, for normal use of data differencing where the files to
|
---|
1144 | be compared have long stretches of matches, much of the gain in
|
---|
1145 | compression rate is already achieved by normal string matching.
|
---|
1146 | Thus, the use of secondary compressors is seldom needed in this case.
|
---|
1147 | However, for applications beyond differencing of such nearly identical
|
---|
1148 | files, secondary compressors may be needed to achieve maximal
|
---|
1149 | compressed results.
|
---|
1150 |
|
---|
1151 | Therefore, we recommend to leave the Vcdiff data format defined
|
---|
1152 | as in this document so that the use of secondary compressors
|
---|
1153 | can be implemented when they become needed in the future.
|
---|
1154 | The formats of the compressed data via such compressors or any
|
---|
1155 | compressors that may be defined in the future are left open to
|
---|
1156 | their implementations. These could include Huffman encoding,
|
---|
1157 | arithmetic encoding, and splay tree encoding [8,9].
|
---|
1158 |
|
---|
1159 | Large file system vs. small file system:
|
---|
1160 | As discussed in Section 4, a target window in a large file may be
|
---|
1161 | compared against some source window in another file or in the same
|
---|
1162 | file (from some earlier part). In that case, the file offset of the
|
---|
1163 | source window is specified as a variable-sized integer in the delta
|
---|
1164 | encoding. There is a possibility that the encoding was computed on
|
---|
1165 | a system supporting much larger files than in a system where
|
---|
1166 | the data may be decoded (e.g., 64-bit file systems vs. 32-bit file
|
---|
1167 | systems). In that case, some target data may not be recoverable.
|
---|
1168 | This problem could afflict any compression format, and ought
|
---|
1169 | to be resolved with a generic negotiation mechanism in the
|
---|
1170 | appropriate protocol(s).
|
---|
1171 |
|
---|
1172 |
|
---|
1173 | 10. SUMMARY
|
---|
1174 |
|
---|
1175 | We have described Vcdiff, a general and portable encoding format for
|
---|
1176 | compression and differencing. The format is good in that it allows
|
---|
1177 | implementing a decoder without knowledge of the encoders. Further,
|
---|
1178 | ignoring the use of secondary compressors not defined within the format,
|
---|
1179 | the decoding algorithms runs in linear time and requires working space
|
---|
1180 | proportional to window sizes.
|
---|
1181 |
|
---|
1182 |
|
---|
1183 |
|
---|
1184 | 11. ACKNOWLEDGEMENTS
|
---|
1185 |
|
---|
1186 | Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff
|
---|
1187 | who provided much encouragement to publicize Vcdiff. In particular, Jeff
|
---|
1188 | helped clarifying the description of the data format presented here.
|
---|
1189 |
|
---|
1190 |
|
---|
1191 |
|
---|
1192 | 12. SECURITY CONSIDERATIONS
|
---|
1193 |
|
---|
1194 | Vcdiff only provides a format to encode compressed and differenced data.
|
---|
1195 | It does not address any issues concerning how such data are, in fact,
|
---|
1196 | stored in a given file system or the run-time memory of a computer system.
|
---|
1197 | Therefore, we do not anticipate any security issues with respect to Vcdiff.
|
---|
1198 |
|
---|
1199 |
|
---|
1200 |
|
---|
1201 | 13. SOURCE CODE AVAILABILITY
|
---|
1202 |
|
---|
1203 | Vcdiff is implemented as a data transforming method in Phong Vo's
|
---|
1204 | Vcodex library. AT&T Corp. has made the source code for Vcodex available
|
---|
1205 | for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11].
|
---|
1206 | The source code and according license is accessible at the below URL:
|
---|
1207 |
|
---|
1208 | http://www.research.att.com/sw/tools
|
---|
1209 |
|
---|
1210 |
|
---|
1211 | 14. INTELLECTUAL PROPERTY RIGHTS
|
---|
1212 |
|
---|
1213 | The IETF has been notified of intellectual property rights claimed in
|
---|
1214 | regard to some or all of the specification contained in this
|
---|
1215 | document. For more information consult the online list of claimed
|
---|
1216 | rights, at <http://www.ietf.org/ipr.html>.
|
---|
1217 |
|
---|
1218 | The IETF takes no position regarding the validity or scope of any
|
---|
1219 | intellectual property or other rights that might be claimed to
|
---|
1220 | pertain to the implementation or use of the technology described in
|
---|
1221 | this document or the extent to which any license under such rights
|
---|
1222 | might or might not be available; neither does it represent that it
|
---|
1223 | has made any effort to identify any such rights. Information on the
|
---|
1224 | IETF's procedures with respect to rights in standards-track and
|
---|
1225 | standards-related documentation can be found in BCP-11. Copies of
|
---|
1226 | claims of rights made available for publication and any assurances of
|
---|
1227 | licenses to be made available, or the result of an attempt made to
|
---|
1228 | obtain a general license or permission for the use of such
|
---|
1229 | proprietary rights by implementors or users of this specification can
|
---|
1230 | be obtained from the IETF Secretariat.
|
---|
1231 |
|
---|
1232 |
|
---|
1233 |
|
---|
1234 | 15. IANA CONSIDERATIONS
|
---|
1235 |
|
---|
1236 | The Internet Assigned Numbers Authority (IANA) administers the number
|
---|
1237 | space for Secondary Compressor ID values. Values and their meaning
|
---|
1238 | must be documented in an RFC or other peer-reviewed, permanent, and
|
---|
1239 | readily available reference, in sufficient detail so that
|
---|
1240 | interoperability between independent implementations is possible.
|
---|
1241 | Subject to these constraints, name assignments are First Come, First
|
---|
1242 | Served - see RFC2434 [13]. Legal ID values are in the range 1..255.
|
---|
1243 |
|
---|
1244 | This document does not define any values in this number space.
|
---|
1245 |
|
---|
1246 |
|
---|
1247 | 16. REFERENCES
|
---|
1248 |
|
---|
1249 | [1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
|
---|
1250 | Practical Reusable Unix Software, Editor B. Krishnamurthy,
|
---|
1251 | John Wiley & Sons, Inc., 1995.
|
---|
1252 |
|
---|
1253 | [2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
|
---|
1254 | Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977.
|
---|
1255 |
|
---|
1256 | [3] W. Tichy, The String-to-String Correction Problem with Block Moves,
|
---|
1257 | ACM Transactions on Computer Systems, 2(4):309-321, November 1984.
|
---|
1258 |
|
---|
1259 | [4] E.M. McCreight, A Space-Economical Suffix Tree Construction
|
---|
1260 | Algorithm, Journal of the ACM, 23:262-272, 1976.
|
---|
1261 |
|
---|
1262 | [5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms,
|
---|
1263 | IEEE Software Configuration and Maintenance Workshop, 1996.
|
---|
1264 |
|
---|
1265 | [6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis,
|
---|
1266 | ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998.
|
---|
1267 |
|
---|
1268 | [7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
|
---|
1269 | Proc. of the Summer '91 Usenix Conference, 1991.
|
---|
1270 |
|
---|
1271 | [8] D. W. Jones, Application of Splay Trees to Data Compression,
|
---|
1272 | CACM, 31(8):996:1007.
|
---|
1273 |
|
---|
1274 | [9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1,
|
---|
1275 | M&T Books, New York, NY, 1995.
|
---|
1276 |
|
---|
1277 | [10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy,
|
---|
1278 | Potential benefits of delta encoding and data compression for HTTP,
|
---|
1279 | SIGCOMM '97, Cannes, France, 1997.
|
---|
1280 |
|
---|
1281 | [11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann,
|
---|
1282 | Y. Goland, and A. Van Hoff, Delta Encoding in HTTP,
|
---|
1283 | IETF, draft-mogul-http-delta-10, 2001.
|
---|
1284 |
|
---|
1285 | [12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels,
|
---|
1286 | RFC 2119, March 1997.
|
---|
1287 |
|
---|
1288 | [13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA
|
---|
1289 | Considerations Section in RFCs, RFC2434, October 1998.
|
---|
1290 |
|
---|
1291 |
|
---|
1292 |
|
---|
1293 | 17. AUTHOR'S ADDRESS
|
---|
1294 |
|
---|
1295 | Kiem-Phong Vo (main contact)
|
---|
1296 | AT&T Labs, Room D223
|
---|
1297 | 180 Park Avenue
|
---|
1298 | Florham Park, NJ 07932
|
---|
1299 | Email: kpv@research.att.com
|
---|
1300 | Phone: 1 973 360 8630
|
---|
1301 |
|
---|
1302 | David G. Korn
|
---|
1303 | AT&T Labs, Room D237
|
---|
1304 | 180 Park Avenue
|
---|
1305 | Florham Park, NJ 07932
|
---|
1306 | Email: dgk@research.att.com
|
---|
1307 | Phone: 1 973 360 8602
|
---|
1308 |
|
---|
1309 | Jeffrey C. Mogul
|
---|
1310 | Western Research Laboratory
|
---|
1311 | Compaq Computer Corporation
|
---|
1312 | 250 University Avenue
|
---|
1313 | Palo Alto, California, 94305, U.S.A.
|
---|
1314 | Email: JeffMogul@acm.org
|
---|
1315 | Phone: 1 650 617 3304 (email preferred)
|
---|
1316 |
|
---|
1317 | Joshua P. MacDonald
|
---|
1318 | Computer Science Division
|
---|
1319 | University of California, Berkeley
|
---|
1320 | 345 Soda Hall
|
---|
1321 | Berkeley, CA 94720
|
---|
1322 | Email: jmacd@cs.berkeley.edu
|
---|