Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log
Repository URL

draft-korn-vcdiff.txt@ 447

Last change on this file since 447 was 185, checked in by geyser, 18 years ago

File size: 59.3 KB

Rev	Line
[185]	1	David G. Korn, AT&T Labs
	2	Joshua P. MacDonald, UC Berkeley
	3	Jeffrey C. Mogul, Compaq WRL
	4	Internet-Draft Kiem-Phong Vo, AT&T Labs
	5	Expires: 09 November 2002 09 November 2001
	6
	7
	8	The VCDIFF Generic Differencing and Compression Data Format
	9
	10	draft-korn-vcdiff-06.txt
	11
	12
	13
	14	Status of this Memo
	15
	16	This document is an Internet-Draft and is in full conformance
	17	with all provisions of Section 10 of RFC2026.
	18
	19	Internet-Drafts are working documents of the Internet Engineering
	20	Task Force (IETF), its areas, and its working groups. Note that
	21	other groups may also distribute working documents as
	22	Internet-Drafts.
	23
	24	Internet-Drafts are draft documents valid for a maximum of six
	25	months and may be updated, replaced, or obsoleted by other
	26	documents at any time. It is inappropriate to use Internet-
	27	Drafts as reference material or to cite them other than as
	28	"work in progress."
	29
	30	The list of current Internet-Drafts can be accessed at
	31	http://www.ietf.org/ietf/1id-abstracts.txt
	32
	33	The list of Internet-Draft Shadow Directories can be accessed at
	34	http://www.ietf.org/shadow.html.
	35
	36
	37	Abstract
	38
	39	This memo describes a general, efficient and portable data format
	40	suitable for encoding compressed and/or differencing data so that
	41	they can be easily transported among computers.
	42
	43
	44	Table of Contents:
	45
	46	1. EXECUTIVE SUMMARY ............................................ 2
	47	2. CONVENTIONS .................................................. 3
	48	3. DELTA INSTRUCTIONS ........................................... 4
	49	4. DELTA FILE ORGANIZATION ...................................... 5
	50	5. DELTA INSTRUCTION ENCODING ................................... 9
	51	6. DECODING A TARGET WINDOW ..................................... 14
	52	7. APPLICATION-DEFINED CODE TABLES .............................. 16
	53	8. PERFORMANCE .................................................. 16
	54	9. FURTHER ISSUES ............................................... 17
	55	10. SUMMARY ...................................................... 18
	56	11. ACKNOWLEDGEMENTS ............................................. 18
	57	12. SECURITY CONSIDERATIONS ...................................... 18
	58	13. SOURCE CODE AVAILABILITY ..................................... 18
	59	14. INTELLECTUAL PROPERTY RIGHTS ................................. 18
	60	15. IANA CONSIDERATIONS .......................................... 19
	61	16. REFERENCES ................................................... 19
	62	17. AUTHOR'S ADDRESS ............................................. 20
	63
	64
	65	1. EXECUTIVE SUMMARY
	66
	67	Compression and differencing techniques can greatly improve storage
	68	and transmission of files and file versions. Since files are often
	69	transported across machines with distinct architectures and performance
	70	characteristics, such data should be encoded in a form that is portable
	71	and can be decoded with little or no knowledge of the encoders.
	72	This document describes Vcdiff, a compact portable encoding format
	73	designed for these purposes.
	74
	75	Data differencing is the process of computing a compact and invertible
	76	encoding of a "target file" given a "source file". Data compression
	77	is similar but without the use of source data. The UNIX utilities diff,
	78	compress, and gzip are well-known examples of data differencing and
	79	compression tools. For data differencing, the computed encoding is
	80	called a "delta file", and, for data compression, it is called
	81	a "compressed file". Delta and compressed files are good for storage
	82	and transmission as they are often smaller than the originals.
	83
	84	Data differencing and data compression are traditionally treated
	85	as distinct types of data processing. However, as shown in the Vdelta
	86	technique by Korn and Vo [1], compression can be thought of as a special
	87	case of differencing in which the source data is empty. The basic idea
	88	is to unify the string parsing scheme used in the Lempel-Ziv'77 style
	89	compressors [2], and the block-move technique of Tichy [3]. Loosely
	90	speaking, this works as follows:
	91
	92	a. Concatenate source and target data.
	93	b. Parse the data from left to right as in LZ'77 but
	94	make sure that a parsed segment starts the target data.
	95	c. Start to output when reaching target data.
	96
	97	Parsing is based on string matching algorithms such as suffix trees [4]
	98	or hashing with different time and space performance characteristics.
	99	Vdelta uses a fast string matching algorithm that requires less memory
	100	than other techniques [5,6]. However, even with this algorithm, the
	101	memory requirement can still be prohibitive for large files. A common
	102	way to deal with memory limitation is to partition an input file into
	103	chunks called "windows" and process them separately. Here, except for
	104	unpublished work by Vo, little has been done on designing effective
	105	windowing schemes. Current techniques, including Vdelta, simply use
	106	source and target windows with corresponding addresses across source
	107	and target files.
	108
	109	String matching and windowing algorithms have large influence on the
	110	compression rate of delta and compressed files. However, it is desirable
	111	to have a portable encoding format that is independent of such algorithms.
	112	This enables construction of client-server applications in which a server
	113	may serve clients with unknown computing characteristics. Unfortunately,
	114	all current differencing and compressing tools, including Vdelta, fall
	115	short in this respect. Their storage formats are closely intertwined
	116	with the implemented string matching and/or windowing algorithms.
	117
	118	The encoding format Vcdiff proposed here addresses the above issues.
	119	Vcdiff achieves the below characteristics:
	120
	121	Output compactness:
	122	The basic encoding format compactly represents compressed or delta
	123	files. Applications can further extend the basic encoding format
	124	with "secondary encoders" to achieve more compression.
	125
	126	Data portability:
	127	The basic encoding format is free from machine byte order and
	128	word size issues. This allows data to be encoded on one machine
	129	and decoded on a different machine with different architecture.
	130
	131	Algorithm genericity:
	132	The decoding algorithm is independent from string matching and
	133	windowing algorithms. This allows competition among implementations
	134	of the encoder while keeping the same decoder.
	135
	136	Decoding efficiency:
	137	Except for secondary encoder issues, the decoding algorithm runs
	138	in time proportional to the size of the target file and uses space
	139	proportional to the maximal window size. Vcdiff differs from more
	140	conventional compressors in that it uses only byte-aligned
	141	data, thus avoiding bit-level operations, which improves
	142	decoding speed at the slight cost of compression efficiency.
	143
	144	The Vcdiff data format and the algorithms for decoding data shall be
	145	described next. Since Vcdiff treats compression as a special case of
	146	differencing, we shall use the term "delta file" to indicate the
	147	compressed output for both cases.
	148
	149
	150	2. CONVENTIONS
	151
	152	The basic data unit is a byte. For portability, Vcdiff shall limit
	153	a byte to its lower eight bits even on machines with larger bytes.
	154	The bits in a byte are ordered from right to left so that the least
	155	significant bit (LSB) has value 1, and the most significant bit (MSB),
	156	has value 128.
	157
	158	For purposes of exposition in this document, we adopt the convention
	159	that the LSB is numbered 0, and the MSB is numbered 7. Bit numbers
	160	never appear in the encoded format itself.
	161
	162	Vcdiff encodes unsigned integer values using a portable variable-sized
	163	format (originally introduced in the Sfio library [7]). This encoding
	164	treats an integer as a number in base 128. Then, each digit in this
	165	representation is encoded in the lower seven bits of a byte. Except for
	166	the least significant byte, other bytes have their most significant bit
	167	turned on to indicate that there are still more digits in the encoding.
	168	The two key properties of this integer encoding that are beneficial
	169	to a data compression format are:
	170
	171	a. The encoding is portable among systems using 8-bit bytes, and
	172	b. Small values are encoded compactly.
	173
	174	For example, consider the value 123456789 which can be represented with
	175	four 7-bit digits whose values are 58, 111, 26, 21 in order from most
	176	to least significant. Below is the 8-bit byte encoding of these digits.
	177	Note that the MSBs of 58, 111 and 26 are on.
	178
	179	+-------------------------------------------+
	180	\| 10111010 \| 11101111 \| 10011010 \| 00010101 \|
	181	+-------------------------------------------+
	182	MSB+58 MSB+111 MSB+26 0+21
	183
	184
	185	Henceforth, the terms "byte" and "integer" will refer to a byte and an
	186	unsigned integer as described.
	187
	188
	189	From time to time, algorithms are exhibited to clarify the descriptions
	190	of parts of the Vcdiff format. On such occasions, the C language will be
	191	used to make precise the algorithms. The C code shown in this
	192	document is meant for clarification only, and is not part of the
	193	actual specification of the Vcdiff format.
	194
	195	In this specification, the key words "MUST", "MUST NOT",
	196	"SHOULD", "SHOULD NOT", and "MAY" document are to be interpreted as
	197	described in RFC2119 [12].
	198
	199
	200	3. DELTA INSTRUCTIONS
	201
	202	A large target file is partitioned into non-overlapping sections
	203	called "target windows". These target windows are processed separately
	204	and sequentially based on their order in the target file.
	205
	206	A target window T of length t may be compared against some source data
	207	segment S of length s. By construction, this source data segment S
	208	comes either from the source file, if one is used, or from a part of
	209	the target file earlier than T. In this way, during decoding, S is
	210	completely known when T is being decoded.
	211
	212	The choices of T, t, S and s are made by some window selection algorithm
	213	which can greatly affect the size of the encoding. However, as seen later,
	214	these choices are encoded so that no knowledge of the window selection
	215	algorithm is needed during decoding.
	216
	217	Assume that S[j] represents the jth byte in S, and T[k] represents
	218	the kth byte in T. Then, for the delta instructions, we treat the data
	219	windows S and T as substrings of a superstring U formed by concatenating
	220	them like this:
	221
	222	S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]
	223
	224	The "address" of a byte in S or T is referred to by its location in U.
	225	For example, the address of T[k] is s+k.
	226
	227	The instructions to encode and direct the reconstruction of a target
	228	window are called delta instructions. There are three types:
	229
	230	ADD: This instruction has two arguments, a size x and a sequence of
	231	x bytes to be copied.
	232	COPY: This instruction has two arguments, a size x and an address p
	233	in the string U. The arguments specify the substring of U that
	234	must be copied. We shall assert that such a substring must be
	235	entirely contained in either S or T.
	236	RUN: This instruction has two arguments, a size x and a byte b that
	237	will be repeated x times.
	238
	239	Below are example source and target windows and the delta instructions
	240	that encode the target window in terms of the source window.
	241
	242	a b c d e f g h i j k l m n o p
	243	a b c d w x y z e f g h e f g h e f g h e f g h z z z z
	244
	245	COPY 4, 0
	246	ADD 4, w x y z
	247	COPY 4, 4
	248	COPY 12, 24
	249	RUN 4, z
	250
	251
	252	Thus, the first letter 'a' in the target window is at location 16
	253	in the superstring. Note that the fourth instruction, "COPY 12, 24",
	254	copies data from T itself since address 24 is position 8 in T.
	255	This instruction also shows that it is fine to overlap the data to be
	256	copied with the data being copied from as long as the latter starts
	257	earlier. This enables efficient encoding of periodic sequences,
	258	i.e., sequences with regularly repeated subsequences. The RUN instruction
	259	is a compact way to encode a sequence repeating the same byte even though
	260	such a sequence can be thought of as a periodic sequence with period 1.
	261
	262	To reconstruct the target window, one simply processes one delta
	263	instruction at a time and copy the data either from the source window
	264	or the being reconstructed target window based on the type of the
	265	instruction and the associated address, if any.
	266
	267
	268	4. DELTA FILE ORGANIZATION
	269
	270	A Vcdiff delta file starts with a Header section followed by a sequence
	271	of Window sections. The Header section includes magic bytes to identify
	272	the file type, and information concerning data processing beyond the
	273	basic encoding format. The Window sections encode the target windows.
	274
	275	Below is the overall organization of a delta file. The indented items
	276	refine the ones immediately above them. An item in square brackets may
	277	or may not be present in the file depending on the information encoded
	278	in the Indicator byte above it.
	279
	280	Header
	281	Header1 - byte
	282	Header2 - byte
	283	Header3 - byte
	284	Header4 - byte
	285	Hdr_Indicator - byte
	286	[Secondary compressor ID] - byte
	287
	288	[@@@ Why is compressor ID not an integer? ]
	289	[@@@ If we aren't defining any secondary compressors yet, then it seems
	290	that defining the [Secondary compressor ID] and the corresponding
	291	VCD_DECOMPRESS Hdr_Indicator bit in this draft has no real value. An
	292	implementation of this specification won't be able to decode a VCDIFF
	293	encoded with this option if it doesn't know about any secondary
	294	compressors. It seems that you should specify the bits related to
	295	secondary compressors once you have defined the first a secondary
	296	compressor. I can imagine a secondary-compressor might want to supply
	297	extra information, such as a dictionary of some kind, in which case
	298	this speculative treatment wouldn't go far enough.]
	299
	300	[Length of code table data] - integer
	301	[Code table data]
	302	Size of near cache - byte
	303	Size of same cache - byte
	304	Compressed code table data
	305	Window1
	306	Win_Indicator - byte
	307	[Source segment size] - integer
	308	[Source segment position] - integer
	309	The delta encoding of the target window
	310	Length of the delta encoding - integer
	311	The delta encoding
	312	Size of the target window - integer
	313	Delta_Indicator - byte
	314	Length of data for ADDs and RUNs - integer
	315	Length of instructions and sizes - integer
	316	Length of addresses for COPYs - integer
	317	Data section for ADDs and RUNs - array of bytes
	318	Instructions and sizes section - array of bytes
	319	Addresses section for COPYs - array of bytes
	320	Window2
	321	...
	322
	323
	324
	325	4.1 The Header Section
	326
	327	Each delta file starts with a header section organized as below.
	328	Note the convention that square-brackets enclose optional items.
	329
	330	Header1 - byte = 0xE6
	331	Header2 - byte = 0xD3
	332	Header3 - byte = 0xD4
	333
	334	HMMM
	335
	336	0xD6
	337	0xC3
	338	0xC4
	339
	340	Header4 - byte
	341	Hdr_Indicator - byte
	342	[Secondary compressor ID] - byte
	343	[Length of code table data] - integer
	344	[Code table data]
	345
	346	The first three Header bytes are the ASCII characters 'V', 'C' and 'D'
	347	with their most significant bits turned on (in hexadecimal, the values
	348	are 0xE6, 0xD3, and 0xD4). The fourth Header byte is currently set to
	349	zero. In the future, it might be used to indicate the version of Vcdiff.
	350
	351	The Hdr_Indicator byte shows if there are any initialization data
	352	required to aid in the reconstruction of data in the Window sections.
	353	This byte MAY have non-zero values for either, both, or neither of
	354	the two bits VCD_DECOMPRESS and VCD_CODETABLE below:
	355
	356	7 6 5 4 3 2 1 0
	357	+-+-+-+-+-+-+-+-+
	358	\| \| \| \| \| \| \| \| \|
	359	+-+-+-+-+-+-+-+-+
	360	^ ^
	361	\| \|
	362	\| +-- VCD_DECOMPRESS
	363	+---- VCD_CODETABLE
	364
	365	If bit 0 (VCD_DECOMPRESS) is non-zero, this indicates that a secondary
	366	compressor may have been used to further compress certain parts of the
	367	delta encoding data as described in Sections 4.3 and 6. In that case,
	368	the ID of the secondary compressor is given next. If this bit is zero,
	369	the compressor ID byte is not included.
	370
	371	[@@@ If we aren't defining any secondary compressors yet, then it seems
	372	this bit has no real value yet..]
	373
	374	If bit 1 (VCD_CODETABLE) is non-zero, this indicates that an
	375	application-defined code table is to be used for decoding the delta
	376	instructions. This table itself is compressed. The length of the data
	377	comprising this compressed code table and the data follow next. Section 7
	378	discusses application-defined code tables. If this bit is zero, the code
	379	table data length and the code table data are not included.
	380
	381	If both bits are set, then the compressor ID byte is included
	382	before the code table data length and the code table data.
	383
	384
	385	4.2 The Format of a Window Section
	386
	387	Each Window section is organized as follows:
	388
	389	Win_Indicator - byte
	390	[Source segment length] - integer
	391	[Source segment position] - integer
	392	The delta encoding of the target window
	393
	394
	395	Below are the detail of the various items:
	396
	397	[@@@ Here, I want to replace the Win_Indicator with a source-count,
	398	followed by source-count length/position pairs?]
	399
	400	Win_Indicator:
	401	This byte is a set of bits, as shown:
	402
	403	7 6 5 4 3 2 1 0
	404	+-+-+-+-+-+-+-+-+
	405	\| \| \| \| \| \| \| \| \|
	406	+-+-+-+-+-+-+-+-+
	407	^ ^
	408	\| \|
	409	\| +-- VCD_SOURCE
	410	+---- VCD_TARGET
	411
	412
	413	If bit 0 (VCD_SOURCE) is non-zero, this indicates that a segment
	414	of data from the "source" file was used as the corresponding
	415	source window of data to encode the target window. The decoder
	416	will use this same source data segment to decode the target window.
	417
	418	If bit 1 (VCD_TARGET) is non-zero, this indicates that a segment
	419	of data from the "target" file was used as the corresponding
	420	source window of data to encode the target window. As above, this
	421	same source data segment is used to decode the target window.
	422
	423	The Win_Indicator byte MUST NOT have more than one of the bits
	424	set (non-zero). It MAY have none of these bits set.
	425
	426	If one of these bits is set, the byte is followed by two
	427	integers to indicate respectively the length and position of
	428	the source data segment in the relevant file. If the
	429	indicator byte is zero, the target window was compressed
	430	by itself without comparing against another data segment,
	431	and these two integers are not included.
	432
	433	The delta encoding of the target window:
	434	This contains the delta encoding of the target window either
	435	in terms of the source data segment (i.e., VCD_SOURCE
	436	or VCD_TARGET was set) or by itself if no source window
	437	is specified. This data format is discussed next.
	438
	439
	440	4.3 The Delta Encoding of a Target Window
	441
	442	The delta encoding of a target window is organized as follows:
	443
	444	Length of the delta encoding - integer
	445	The delta encoding
	446	Length of the target window - integer
	447	Delta_Indicator - byte
	448	Length of data for ADDs and RUNs - integer
	449	Length of instructions section - integer
	450	Length of addresses for COPYs - integer
	451	Data section for ADDs and RUNs - array of bytes
	452	Instructions and sizes section - array of bytes
	453	Addresses section for COPYs - array of bytes
	454
	455
	456	Length of the delta encoding:
	457	This integer gives the total number of remaining bytes that
	458	comprise data of the delta encoding for this target window.
	459
	460	The delta encoding:
	461	This contains the data representing the delta encoding which
	462	is described next.
	463
	464	Length of the target window:
	465	This integer indicates the actual size of the target window
	466	after decompression. A decoder can use this value to allocate
	467	memory to store the uncompressed data.
	468
	469	Delta_Indicator:
	470	This byte is a set of bits, as shown:
	471
	472	7 6 5 4 3 2 1 0
	473	+-+-+-+-+-+-+-+-+
	474	\| \| \| \| \| \| \| \| \|
	475	+-+-+-+-+-+-+-+-+
	476	^ ^ ^
	477	\| \| \|
	478	\| \| +-- VCD_DATACOMP
	479	\| +---- VCD_INSTCOMP
	480	+------ VCD_ADDRCOMP
	481
	482	VCD_DATACOMP: bit value 1.
	483	VCD_INSTCOMP: bit value 2.
	484	VCD_ADDRCOMP: bit value 4.
	485
	486	As discussed, the delta encoding consists of COPY, ADD and RUN
	487	instructions. The ADD and RUN instructions have accompanying
	488	unmatched data (that is, data that does not specifically match
	489	any data in the source window or in some earlier part of the
	490	target window) and the COPY instructions have addresses of where
	491	the matches occur. OPTIONALLY, these types of data MAY be further
	492	compressed using a secondary compressor. Thus, Vcdiff separates
	493	the encoding of the delta instructions into three parts:
	494
	495	a. The unmatched data in the ADD and RUN instructions,
	496	b. The delta instructions and accompanying sizes, and
	497	c. The addresses of the COPY instructions.
	498
	499	If the bit VCD_DECOMPRESS (Section 4.1) was on, each of these
	500	sections may have been compressed using the specified secondary
	501	compressor. The bit positions 0 (VCD_DATACOMP), 1 (VCD_INSTCOMP),
	502	and 2 (VCD_ADDRCOMP) respectively indicate, if non-zero, that
	503	the corresponding parts are compressed. Then, these parts MUST
	504	be decompressed before decoding the delta instructions.
	505
	506	Length of data for ADDs and RUNs:
	507	This is the length (in bytes) of the section of data storing
	508	the unmatched data accompanying the ADD and RUN instructions.
	509
	510	Length of instructions section:
	511	This is the length (in bytes) of the delta instructions and
	512	accompanying sizes.
	513
	514	Length of addresses for COPYs:
	515	This is the length (in bytes) of the section storing
	516	the addresses of the COPY instructions.
	517
	518	Data section for ADDs and RUNs:
	519	This sequence of bytes encodes the unmatched data for the ADD
	520	and RUN instructions.
	521
	522	Instructions and sizes section:
	523	This sequence of bytes encodes the instructions and their sizes.
	524
	525	Addresses section for COPYs:
	526	This sequence of bytes encodes the addresses of the COPY
	527	instructions.
	528
	529
	530	5. DELTA INSTRUCTION ENCODING
	531
	532	The delta instructions described in Section 3 represent the results of
	533	string matching. For many data differencing applications in which the
	534	changes between source and target data are small, any straightforward
	535	representation of these instructions would be adequate. However, for
	536	applications including data compression, it is important to encode
	537	these instructions well to achieve good compression rates. From our
	538	experience, the following observations can be made:
	539
	540	a. The addresses in COPY instructions are locations of matches and
	541	often occur close by or even exactly equal to one another. This is
	542	because data in local regions are often replicated with minor changes.
	543	In turn, this means that coding a newly matched address against some
	544	set of recently matched addresses can be beneficial.
	545
	546	b. The matches are often short in length and separated by small amounts
	547	of unmatched data. That is, the lengths of COPY and ADD instructions
	548	are often small. This is particularly true of binary data such as
	549	executable files or structured data such as HTML or XML. In such cases,
	550	compression can be improved by combining the encoding of the sizes
	551	and the instruction types as well as combining the encoding of adjacent
	552	delta instructions with sufficiently small data sizes.
	553
	554	The below subsections discuss how the Vcdiff data format provides
	555	mechanisms enabling encoders to use the above observations to improve
	556	compression rates.
	557
	558
	559	5.1 Address Encoding Modes of COPY Instructions
	560
	561	As mentioned earlier, addresses of COPY instructions often occur close
	562	to one another or are exactly equal. To take advantage of this phenomenon
	563	and encode addresses of COPY instructions more efficiently, the Vcdiff
	564	data format supports the use of two different types of address caches.
	565	Both the encoder and decoder maintain these caches, so that decoder's
	566	caches remain synchronized with the encoder's caches.
	567
	568	a. A "near" cache is an array with "s_near" slots, each containing an
	569	address used for encoding addresses nearby to previously encoded
	570	addresses (in the positive direction only). The near cache also
	571	maintains a "next_slot" index to the near cache. New entries to the
	572	near cache are always inserted in the next_slot index, which maintains
	573	a circular buffer of the s_near most recent addresses.
	574
	575	b. A "same" cache is an array with "s_same" multiple of 256 slots, each
	576	containing an address. The same cache maintains a hash table of recent
	577	addresses used for repeated encoding of the exact same address.
	578
	579
	580	By default, the parameters s_near and s_same are respectively set to 4
	581	and 3. An encoder MAY modify these values, but then it MUST encode the
	582	new values in the encoding itself, as discussed in Section 7, so that
	583	the decoder can properly set up its own caches.
	584
	585	At the start of processing a target window, an implementation
	586	(encoder or decoder) initializes all of the slots in both caches
	587	to zero. The next_slot pointer of the near cache is set
	588	to point to slot zero.
	589
	590	Each time a COPY instruction is processed by the encoder or
	591	decoder, the implementation's caches are updated as follows, where
	592	"addr" is the address in the COPY instruction.
	593
	594	a. The slot in the near cache referenced by the next_slot
	595	index is set to addr. The next_slot index is then incremented
	596	modulo s_near.
	597
	598	b. The slot in the same cache whose index is addr%(s_same*256)
	599	is set to addr. [We use the C notations of % for modulo and
	600	* for multiplication.]
	601
	602
	603	5.2 Example code for maintaining caches
	604
	605	To make clear the above description, below are example cache data
	606	structures and algorithms to initialize and update them:
	607
	608	typedef struct _cache_s
	609	{
	610	int* near; /* array of size s_near */
	611	int s_near;
	612	int next_slot; /* the circular index for near */
	613	int* same; /* array of size s_same256 /
	614	int s_same;
	615	} Cache_t;
	616
	617	cache_init(Cache_t* ka)
	618	{
	619	int i;
	620
	621	ka->next_slot = 0;
	622	for(i = 0; i < ka->s_near; ++i)
	623	ka->near[i] = 0;
	624
	625	for(i = 0; i < ka->s_same*256; ++i)
	626	ka->same[i] = 0;
	627	}
	628
	629	cache_update(Cache_t* ka, int addr)
	630	{
	631	if(ka->s_near > 0)
	632	{ ka->near[ka->next_slot] = addr;
	633	ka->next_slot = (ka->next_slot + 1) % ka->s_near;
	634	}
	635
	636	if(ka->s_same > 0)
	637	ka->same[addr % (ka->s_same*256)] = addr;
	638	}
	639
	640
	641	5.3 Encoding of COPY instruction addresses
	642
	643	The address of a COPY instruction is encoded using different modes
	644	depending on the type of cached address used, if any.
	645
	646	Let "addr" be the address of a COPY instruction to be decoded and "here"
	647	be the current location in the target data (i.e., the start of the data
	648	about to be encoded or decoded). Let near[j] be the jth element in
	649	the near cache, and same[k] be the kth element in the same cache.
	650	Below are the possible address modes:
	651
	652	VCD_SELF: This mode has value 0. The address was encoded by itself
	653	as an integer.
	654
	655	VCD_HERE: This mode has value 1. The address was encoded as
	656	the integer value "here - addr".
	657
	658	Near modes: The "near modes" are in the range [2,s_near+1]. Let m
	659	be the mode of the address encoding. The address was encoded
	660	as the integer value "addr - near[m-2]".
	661
	662	Same modes: The "same modes" are in the range
	663	[s_near+2,s_near+s_same+1]. Let m be the mode of the encoding.
	664	The address was encoded as a single byte b such that
	665	"addr == same[(m - (s_near+2))*256 + b]".
	666
	667
	668	5.3 Example code for encoding and decoding of COPY instruction addresses
	669
	670	We show example algorithms below to demonstrate use of address modes more
	671	clearly. The encoder has freedom to choose address modes, the sample
	672	addr_encode() algorithm merely shows one way of picking the address
	673	mode. The decoding algorithm addr_decode() will uniquely decode addresses
	674	regardless of the encoder's algorithm choice.
	675
	676	Note that the address caches are updated immediately after an address is
	677	encoded or decoded. In this way, the decoder is always synchronized with
	678	the encoder.
	679
	680	int addr_encode(Cache_t* ka, int addr, int here, int* mode)
	681	{
	682	int i, d, bestd, bestm;
	683
	684	/* Attempt to find the address mode that yields the
	685	* smallest integer value for "d", the encoded address
	686	* value, thereby minimizing the encoded size of the
	687	* address. */
	688
	689	bestd = addr; bestm = VCD_SELF; /* VCD_SELF == 0 */
	690
	691	if((d = here-addr) < bestd)
	692	{ bestd = d; bestm = VCD_HERE; } /* VCD_HERE == 1 */
	693
	694	for(i = 0; i < ka->s_near; ++i)
	695	if((d = addr - ka->near[i]) >= 0 && d < bestd)
	696	{ bestd = d; bestm = i+2; }
	697
	698	if(ka->s_same > 0 && ka->same[d = addr%(ka->s_same*256)] == addr)
	699	{ bestd = d%256; bestm = ka->s_near + 2 + d/256; }
	700
	701	cache_update(ka,addr);
	702
	703	mode = bestm; / this returns the address encoding mode */
	704	return bestd; /* this returns the encoded address */
	705	}
	706
	707	Note that the addr_encode() algorithm chooses the best address mode using a
	708	local optimization, but that may not lead to the best encoding efficiency
	709	because different modes lead to different instruction encodings, as described below.
	710
	711	The functions addrint() and addrbyte() used in addr_decode() obtain from
	712	the "Addresses section for COPYs" (Section 4.3) an integer or a byte,
	713	respectively. These utilities will not be described here. We simply
	714	recall that an integer is represented as a compact variable-sized string
	715	of bytes as described in Section 2 (i.e., base 128).
	716
	717	int addr_decode(Cache_t* ka, int here, int mode)
	718	{ int addr, m;
	719
	720	if(mode == VCD_SELF)
	721	addr = addrint();
	722	else if(mode == VCD_HERE)
	723	addr = here - addrint();
	724	else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
	725	addr = ka->near[m] + addrint();
	726	else /* same cache */
	727	{ m = mode - (2 + ka->s_near);
	728	addr = ka->same[m*256 + addrbyte()];
	729	}
	730
	731	cache_update(ka, addr);
	732
	733	return addr;
	734	}
	735
	736
	737	5.4 Instruction Codes
	738
	739	As noted, the data sizes associated with delta instructions are often
	740	small. Thus, compression efficiency can be improved by combining the sizes
	741	and instruction types in a single encoding, as well by combining certain
	742	pairs of adjacent delta instructions. Effective choices of when to perform
	743	such combinations depend on many factors including the data being processed
	744	and the string matching algorithm in use. For example, if many COPY
	745	instructions have the same data sizes, it may be worth to encode these
	746	instructions more compactly than others.
	747
	748	The Vcdiff data format is designed so that a decoder does not need to be
	749	aware of the choices made in encoding algorithms. This is achieved with the
	750	notion of an "instruction code table" containing 256 entries. Each entry
	751	defines either a single delta instruction or a pair of instructions that
	752	have been combined. Note that the code table itself only exists in main
	753	memory, not in the delta file (unless using an application-defined code
	754	table, described in Section 7). The encoded data simply includes the index
	755	of each instruction and, since there are only 256 indices, each index
	756	can be represented as a single byte.
	757
	758	Each instruction code entry contains six fields, each of which
	759	is a single byte with unsigned value:
	760
	761	+-----------------------------------------------+
	762	\| inst1 \| size1 \| mode1 \| inst2 \| size2 \| mode2 \|
	763	+-----------------------------------------------+
	764
	765	@@@ could be more compact
	766
	767	Each triple (inst,size,mode) defines a delta instruction. The meanings
	768	of these fields are as follows:
	769
	770	inst: An "inst" field can have one of the four values: NOOP (0), ADD (1),
	771	RUN (2) or COPY (3) to indicate the instruction types. NOOP means
	772	that no instruction is specified. In this case, both the corresponding
	773	size and mode fields will be zero.
	774
	775	size: A "size" field is zero or positive. A value zero means that the
	776	size associated with the instruction is encoded separately as
	777	an integer in the "Instructions and sizes section" (Section 6).
	778	A positive value for "size" defines the actual data size.
	779	Note that since the size is restricted to a byte, the maximum
	780	value for any instruction with size implicitly defined in the code
	781	table is 255.
	782
	783	mode: A "mode" field is significant only when the associated delta
	784	instruction is a COPY. It defines the mode used to encode the
	785	associated addresses. For other instructions, this is always zero.
	786
	787
	788	5.5 The Code Table
	789
	790	Following the discussions on address modes and instruction code tables,
	791	we define a "Code Table" to have the data below:
	792
	793	s_near: the size of the near cache,
	794	s_same: the size of the same cache,
	795	i_code: the 256-entry instruction code table.
	796
	797	Vcdiff itself defines a "default code table" in which s_near is 4
	798	and s_same is 3. Thus, there are 9 address modes for a COPY instruction.
	799	The first two are VCD_SELF (0) and VCD_HERE (1). Modes 2, 3, 4 and 5
	800	are for addresses coded against the near cache. And, modes 6, 7 and 8
	801	are for addresses coded against the same cache.
	802
	803	The default instruction code table is depicted below, in a compact
	804	representation that we use only for descriptive purposes. See section 7
	805	for the specification of how an instruction code table is represented
	806	in the Vcdiff encoding format. In the depiction, a zero value for
	807	size indicates that the size is separately coded. The mode of non-COPY
	808	instructions is represented as 0 even though they are not used.
	809
	810
	811	TYPE SIZE MODE TYPE SIZE MODE INDEX
	812	---------------------------------------------------------------
	813	1. RUN 0 0 NOOP 0 0 0
	814	2. ADD 0, [1,17] 0 NOOP 0 0 [1,18]
	815	3. COPY 0, [4,18] 0 NOOP 0 0 [19,34]
	816	4. COPY 0, [4,18] 1 NOOP 0 0 [35,50]
	817	5. COPY 0, [4,18] 2 NOOP 0 0 [51,66]
	818	6. COPY 0, [4,18] 3 NOOP 0 0 [67,82]
	819	7. COPY 0, [4,18] 4 NOOP 0 0 [83,98]
	820	8. COPY 0, [4,18] 5 NOOP 0 0 [99,114]
	821	9. COPY 0, [4,18] 6 NOOP 0 0 [115,130]
	822	10. COPY 0, [4,18] 7 NOOP 0 0 [131,146]
	823	11. COPY 0, [4,18] 8 NOOP 0 0 [147,162]
	824	12. ADD [1,4] 0 COPY [4,6] 0 [163,174]
	825	13. ADD [1,4] 0 COPY [4,6] 1 [175,186]
	826	14. ADD [1,4] 0 COPY [4,6] 2 [187,198]
	827	15. ADD [1,4] 0 COPY [4,6] 3 [199,210]
	828	16. ADD [1,4] 0 COPY [4,6] 4 [211,222]
	829	17. ADD [1,4] 0 COPY [4,6] 5 [223,234]
	830	18. ADD [1,4] 0 COPY 4 6 [235,238]
	831	19. ADD [1,4] 0 COPY 4 7 [239,242]
	832	20. ADD [1,4] 0 COPY 4 8 [243,246]
	833	21. COPY 4 [0,8] ADD 1 0 [247,255]
	834	---------------------------------------------------------------
	835
	836	In the above depiction, each numbered line represents one or more
	837	entries in the actual instruction code table (recall that an entry in
	838	the instruction code table may represent up to two combined delta
	839	instructions.) The last column ("INDEX") shows which index value or
	840	range of index values of the entries covered by that line. The notation
	841	[i,j] means values from i through j, inclusive. The first 6 columns of
	842	a line in the depiction describe the pairs of instructions used for
	843	the corresponding index value(s).
	844
	845	If a line in the depiction includes a column entry using the [i,j]
	846	notation, this means that the line is instantiated for each value
	847	in the range from i to j, inclusive. The notation "0, [i,j]" means
	848	that the line is instantiated for the value 0 and for each value
	849	in the range from i to j, inclusive.
	850
	851	If a line in the depiction includes more than one entry using the [i,j]
	852	notation, implying a "nested loop" to convert the line to a range of
	853	table entries, the first such [i,j] range specifies the outer loop,
	854	and the second specifies the inner loop.
	855
	856	The below examples should make clear the above description:
	857
	858	Line 1 shows the single RUN instruction with index 0. As the size field
	859	is 0, this RUN instruction always has its actual size encoded separately.
	860
	861	Line 2 shows the 18 single ADD instructions. The ADD instruction with
	862	size field 0 (i.e., the actual size is coded separately) has index 1.
	863	ADD instructions with sizes from 1 to 17 use code indices 2 to 18 and
	864	their sizes are as given (so they will not be separately encoded.)
	865
	866	Following the single ADD instructions are the single COPY instructions
	867	ordered by their address encoding modes. For example, line 11 shows the
	868	COPY instructions with mode 8, i.e., the last of the same cache.
	869	In this case, the COPY instruction with size field 0 has index 147.
	870	Again, the actual size of this instruction will be coded separately.
	871
	872	Lines 12 to 21 show the pairs of instructions that are combined together.
	873	For example, line 12 depicts the 12 entries in which an ADD instruction
	874	is combined with an immediately following COPY instruction. The entries
	875	with indices 163, 164, 165 represent the pairs in which the ADD
	876	instructions all have size 1 while the COPY instructions has mode
	877	0 (VCD_SELF) and sizes 4, 5 and 6 respectively.
	878
	879	The last line, line 21, shows the eight instruction pairs where the first
	880	instruction is a COPY and the second is an ADD. In this case, all COPY
	881	instructions have size 4 with mode ranging from 0 to 8 and all the ADD
	882	instructions have size 1. Thus, the entry with largest index 255
	883	combines a COPY instruction of size 4 and mode 8 with an ADD instruction
	884	of size 1.
	885
	886	The choice of the minimum size 4 for COPY instructions in the default code
	887	table was made from experiments that showed that excluding small matches
	888	(less then 4 bytes long) improved the compression rates.
	889
	890
	891	6. DECODING A TARGET WINDOW
	892
	893	Section 4.3 discusses that the delta instructions and associated data
	894	are encoded in three arrays of bytes:
	895
	896	Data section for ADDs and RUNs,
	897	Instructions and sizes section, and
	898	Addresses section for COPYs.
	899
	900
	901	Further, these data sections may have been further compressed by some
	902	secondary compressor. Assuming that any such compressed data has been
	903	decompressed so that we now have three arrays:
	904
	905	inst: bytes coding the instructions and sizes.
	906	data: unmatched data associated with ADDs and RUNs.
	907	addr: bytes coding the addresses of COPYs.
	908
	909	These arrays are organized as follows:
	910
	911	inst:
	912	a sequence of (index, [size1], [size2]) tuples, where "index"
	913	is an index into the instruction code table, and size1 and size2
	914	are integers that MAY or MAY NOT be included in the tuple as
	915	follows. The entry with the given "index" in the instruction
	916	code table potentially defines two delta instructions. If the
	917	first delta instruction is not a VCD_NOOP and its size is zero,
	918	then size1 MUST be present. Otherwise, size1 MUST be omitted and
	919	the size of the instruction (if it is not VCD_NOOP) is as defined
	920	in the table. The presence or absence of size2 is defined
	921	similarly with respect to the second delta instruction.
	922
	923	data:
	924	a sequence of data values, encoded as bytes.
	925
	926	addr:
	927	a sequence of address values. Addresses are normally encoded as
	928	integers as described in Section 2 (i.e., base 128).
	929	Since the same cache emits addresses in the range [0,255],
	930	however, same cache addresses are always encoded as a
	931	single byte.
	932
	933	To summarize, each tuple in the "inst" array includes an index to some
	934	entry in the instruction code table that determines:
	935
	936	a. Whether one or two instructions were encoded and their types.
	937
	938	b. If the instructions have their sizes encoded separately, these
	939	sizes will follow, in order, in the tuple.
	940
	941	c. If the instructions have accompanying data, i.e., ADDs or RUNs,
	942	their data will be in the array "data".
	943
	944	d. Similarly, if the instructions are COPYs, the coded addresses are
	945	found in the array "addr".
	946
	947	The decoding procedure simply processes the arrays by reading one code
	948	index at a time, looking up the corresponding instruction code entry,
	949	then consuming the respective sizes, data and addresses following the
	950	directions in this entry. In other words, the decoder maintains an implicit
	951	next-element pointer for each array; "consuming" an instruction tuple,
	952	data, or address value implies incrementing the associated pointer.
	953
	954	For example, if during the processing of the target window, the next
	955	unconsumed tuple in the inst array has index value 19, then the first
	956	instruction is a COPY, whose size is found as the immediately following
	957	integer in the inst array. Since the mode of this COPY instruction is
	958	VCD_SELF, the corresponding address is found by consuming the next
	959	integer in the addr array. The data array is left intact. As the second
	960	instruction for code index 19 is a NOOP, this tuple is finished.
	961
	962
	963	7. APPLICATION-DEFINED CODE TABLES
	964
	965	Although the default code table used in Vcdiff is good for general
	966	purpose encoders, there are times when other code tables may perform
	967	better. For example, to code a file with many identical segments of data,
	968	it may be advantageous to have a COPY instruction with the specific size
	969	of these data segments so that the instruction can be encoded in a single
	970	byte. Such a special code table MUST then be encoded in the delta file
	971	so that the decoder can reconstruct it before decoding the data.
	972
	973	Vcdiff allows an application-defined code table to be specified
	974	in a delta file with the following data:
	975
	976	Size of near cache - byte
	977	Size of same cache - byte
	978	Compressed code table data
	979
	980	The "compressed code table data" encodes the delta between the default
	981	code table (source) and the new code table (target) in the same manner as
	982	described in Section 4.3 for encoding a target window in terms of a
	983	source window. This delta is computed using the following steps:
	984
	985	a. Convert the new instruction code table into a string, "code", of
	986	1536 bytes using the below steps in order:
	987
	988	i. Add in order the 256 bytes representing the types of the first
	989	instructions in the instruction pairs.
	990	ii. Add in order the 256 bytes representing the types of the second
	991	instructions in the instruction pairs.
	992	iii. Add in order the 256 bytes representing the sizes of the first
	993	instructions in the instruction pairs.
	994	iv. Add in order the 256 bytes representing the sizes of the second
	995	instructions in the instruction pairs.
	996	v. Add in order the 256 bytes representing the modes of the first
	997	instructions in the instruction pairs.
	998	vi. Add in order the 256 bytes representing the modes of the second
	999	instructions in the instruction pairs.
	1000
	1001	b. Similarly, convert the default instruction code table into
	1002	a string "dflt".
	1003
	1004	c. Treat the string "code" as a target window and "dflt" as the
	1005	corresponding source data and apply an encoding algorithm to
	1006	compute the delta encoding of "code" in terms of "dflt".
	1007	This computation MUST use the default code table for encoding
	1008	the delta instructions.
	1009
	1010	The decoder can then reverse the above steps to decode the compressed
	1011	table data using the method of Section 6, employing the default code
	1012	table, to generate the new code table. Note that the decoder does not
	1013	need to know anything about the details of the encoding algorithm used
	1014	in step (c). The decoder is still able to decode the new code table
	1015	because the Vcdiff format is independent from the choice of encoding
	1016	algorithm, and because the encoder in step (c) uses the known, default
	1017	code table.
	1018
	1019
	1020	8. PERFORMANCE
	1021
	1022	The encoding format is compact. For compression only, using the LZ-77
	1023	string parsing strategy and without any secondary compressors, the typical
	1024	compression rate is better than Unix compress and close to gzip. For
	1025	differencing, the data format is better than all known methods in
	1026	terms of its stated goal, which is primarily decoding speed and
	1027	encoding efficiency.
	1028
	1029	We compare the performance of compress, gzip and Vcdiff using the
	1030	archives of three versions of the Gnu C compiler, gcc-2.95.1.tar,
	1031	gcc-2.95.2.tar and gcc-2.95.3.tar. The experiments were done on an
	1032	SGI-MIPS3, 400MHZ. Gzip was used at its default compression level.
	1033	Vcdiff timings were done using the Vcodex/Vcdiff software (Section 13).
	1034	As string and window matching typically dominates the computation during
	1035	compression, the Vcdiff compression times were directly due to the
	1036	algorithms used in the Vcodex/Vcdiff software. However, the decompression
	1037	times should be generic and representative of any good implementation
	1038	of the Vcdiff data format. Timing was done by running each program
	1039	three times and taking the average of the total cpu+system times.
	1040
	1041	Below are the different Vcdiff runs:
	1042
	1043	Vcdiff: vcdiff is used as compressor only.
	1044
	1045	Vcdiff-d: vcdiff is used as a differencer only. That is, it only
	1046	compares target data against source data. Since the files
	1047	involved are large, they are broken into windows. In this
	1048	case, each target window starting at some file offset in
	1049	the target file is compared against a source window with
	1050	the same file offset (in the source file). The source
	1051	window is also slightly larger than the target window
	1052	to increase matching opportunities. The -d option also gives
	1053	a hint to the string matching algorithm of Vcdiff that
	1054	the two files are very similar with long stretches of matches.
	1055	The algorithm takes advantage of this to minimize its
	1056	processing of source data and save time.
	1057
	1058	Vcdiff-dc: This is similar to Vcdiff-d but vcdiff can also compare
	1059	target data against target data as applicable. Thus, vcdiff
	1060	both computes differences and compresses data. The windowing
	1061	algorithm is the same as above. However, the above hint is
	1062	recinded in this case.
	1063
	1064	Vcdiff-dcs: This is similar to Vcdiff-dc but the windowing algorithm
	1065	uses a content-based heuristic to select source data segments
	1066	that are more likely to match with a given target window.
	1067	Thus, the source data segment selected for a target window
	1068	often will not be aligned with the file offsets of this
	1069	target window.
	1070
	1071
	1072	gcc-2.95.1 gcc-2.95.2 compression decompression
	1073	raw size 55746560 55797760
	1074	compress - 19939390 13.85s 7.09s
	1075	gzip - 12973443 42.99s 5.35s
	1076	Vcdiff - 15358786 20.04s 4.65s
	1077	Vcdiff-d - 100971 10.93s 1.92s
	1078	Vcdiff-dc - 97246 20.03s 1.84s
	1079	Vcdiff-dcs - 256445 44.81s 1.84s
	1080
	1081	TABLE 1. Compressing gcc-2.95.2.tar given gcc-2.95.1
	1082
	1083
	1084	TABLE 1 shows the raw sizes of gcc-2.95.1.tar and gcc-2.95.2.tar and the
	1085	sizes of the compressed results. As a pure compressor, the compression
	1086	rate for Vcdiff is worse than gzip and better than compress. The last
	1087	three rows shows that when two file versions are very similar, differencing
	1088	can have dramatically good compression rates. Vcdiff-d and Vcdiff-dc use
	1089	the same simple window selection method but Vcdiff-dc also does compression
	1090	so its output is slightly smaller. Vcdiff-dcs uses a heuristic based on
	1091	data content to search for source data that likely will match a given target
	1092	window. Although it does a good job, the heuristic did not always find the
	1093	best matches which are given by the simple algorithm of Vcdiff-d. As a
	1094	result, the output size is slightly larger. Note also that there is a large
	1095	cost in computing matching windows this way. Finally, the compression times
	1096	of Vcdiff-d is nearly half of that of Vcdiff-dc. It is tempting to conclude
	1097	that the compression feature causes the additional time in Vcdiff-dc
	1098	relative to Vcdiff-d. However, this is not the case. The hint given to
	1099	the Vcdiff string matching algorithm that the two files are likely to
	1100	have very long stretches of matches helps the algorithm to minimize
	1101	processing of the "source data", thus saving half the time. However, as we
	1102	shall see below when this hint is wrong, the result is even longer time.
	1103
	1104
	1105	gcc-2.95.2 gcc-2.95.3 compression decompression
	1106	raw size 55797760 55787520
	1107	compress - 19939453 13.54s 7.00s
	1108	gzip - 12998097 42.63s 5.62s
	1109	Vcdiff - 15371737 20.09s 4.74s
	1110	Vcdiff-d - 26383849 71.41s 6.41s
	1111	Vcdiff-dc - 14461203 42.48s 4.82s
	1112	Vcdiff-dcs - 1248543 61.18s 1.99s
	1113
	1114	TABLE 2. Compressing gcc-2.95.3.tar given gcc-2.95.2
	1115
	1116
	1117	TABLE 2 shows the raw sizes of gcc-2.95.2.tar and gcc-2.95.3.tar and
	1118	the sizes of the compressed results. In this case, the tar file of
	1119	gcc-2.95.3 is rearranged in a way that makes the straightforward method
	1120	of matching file offsets for source and target windows fail. As a
	1121	result, Vcdiff-d performs rather dismally both in time and output size.
	1122	The large time for Vcdiff-d is directly due to fact that the string
	1123	matching algorithm has to work much harder to find matches when the hint
	1124	that two files have long matching stretches fails to hold. On the other
	1125	hand, Vcdiff-dc does both differencing and compression resulting in good
	1126	output size. Finally, the window searching heuristic used in Vcdiff-dcs is
	1127	effective in finding the right matching source windows for target windows
	1128	resulting a small output size. This shows why the data format needs to
	1129	have a way to specify matching windows to gain performance. Finally,
	1130	we note that the decoding times are always good regardless of how
	1131	the string matching or window searching algorithms perform.
	1132
	1133
	1134	9. FURTHER ISSUES
	1135
	1136	This document does not address a few issues:
	1137
	1138	Secondary compressors:
	1139	As discussed in Section 4.3, certain sections in the delta encoding
	1140	of a window may be further compressed by a secondary compressor.
	1141	In our experience, the basic Vcdiff format is adequate for most
	1142	purposes so that secondary compressors are seldom needed. In
	1143	particular, for normal use of data differencing where the files to
	1144	be compared have long stretches of matches, much of the gain in
	1145	compression rate is already achieved by normal string matching.
	1146	Thus, the use of secondary compressors is seldom needed in this case.
	1147	However, for applications beyond differencing of such nearly identical
	1148	files, secondary compressors may be needed to achieve maximal
	1149	compressed results.
	1150
	1151	Therefore, we recommend to leave the Vcdiff data format defined
	1152	as in this document so that the use of secondary compressors
	1153	can be implemented when they become needed in the future.
	1154	The formats of the compressed data via such compressors or any
	1155	compressors that may be defined in the future are left open to
	1156	their implementations. These could include Huffman encoding,
	1157	arithmetic encoding, and splay tree encoding [8,9].
	1158
	1159	Large file system vs. small file system:
	1160	As discussed in Section 4, a target window in a large file may be
	1161	compared against some source window in another file or in the same
	1162	file (from some earlier part). In that case, the file offset of the
	1163	source window is specified as a variable-sized integer in the delta
	1164	encoding. There is a possibility that the encoding was computed on
	1165	a system supporting much larger files than in a system where
	1166	the data may be decoded (e.g., 64-bit file systems vs. 32-bit file
	1167	systems). In that case, some target data may not be recoverable.
	1168	This problem could afflict any compression format, and ought
	1169	to be resolved with a generic negotiation mechanism in the
	1170	appropriate protocol(s).
	1171
	1172
	1173	10. SUMMARY
	1174
	1175	We have described Vcdiff, a general and portable encoding format for
	1176	compression and differencing. The format is good in that it allows
	1177	implementing a decoder without knowledge of the encoders. Further,
	1178	ignoring the use of secondary compressors not defined within the format,
	1179	the decoding algorithms runs in linear time and requires working space
	1180	proportional to window sizes.
	1181
	1182
	1183
	1184	11. ACKNOWLEDGEMENTS
	1185
	1186	Thanks are due to Balachander Krishnamurthy, Jeff Mogul and Arthur Van Hoff
	1187	who provided much encouragement to publicize Vcdiff. In particular, Jeff
	1188	helped clarifying the description of the data format presented here.
	1189
	1190
	1191
	1192	12. SECURITY CONSIDERATIONS
	1193
	1194	Vcdiff only provides a format to encode compressed and differenced data.
	1195	It does not address any issues concerning how such data are, in fact,
	1196	stored in a given file system or the run-time memory of a computer system.
	1197	Therefore, we do not anticipate any security issues with respect to Vcdiff.
	1198
	1199
	1200
	1201	13. SOURCE CODE AVAILABILITY
	1202
	1203	Vcdiff is implemented as a data transforming method in Phong Vo's
	1204	Vcodex library. AT&T Corp. has made the source code for Vcodex available
	1205	for anyone to use to transmit data via HTTP/1.1 Delta Encoding [10,11].
	1206	The source code and according license is accessible at the below URL:
	1207
	1208	http://www.research.att.com/sw/tools
	1209
	1210
	1211	14. INTELLECTUAL PROPERTY RIGHTS
	1212
	1213	The IETF has been notified of intellectual property rights claimed in
	1214	regard to some or all of the specification contained in this
	1215	document. For more information consult the online list of claimed
	1216	rights, at <http://www.ietf.org/ipr.html>.
	1217
	1218	The IETF takes no position regarding the validity or scope of any
	1219	intellectual property or other rights that might be claimed to
	1220	pertain to the implementation or use of the technology described in
	1221	this document or the extent to which any license under such rights
	1222	might or might not be available; neither does it represent that it
	1223	has made any effort to identify any such rights. Information on the
	1224	IETF's procedures with respect to rights in standards-track and
	1225	standards-related documentation can be found in BCP-11. Copies of
	1226	claims of rights made available for publication and any assurances of
	1227	licenses to be made available, or the result of an attempt made to
	1228	obtain a general license or permission for the use of such
	1229	proprietary rights by implementors or users of this specification can
	1230	be obtained from the IETF Secretariat.
	1231
	1232
	1233
	1234	15. IANA CONSIDERATIONS
	1235
	1236	The Internet Assigned Numbers Authority (IANA) administers the number
	1237	space for Secondary Compressor ID values. Values and their meaning
	1238	must be documented in an RFC or other peer-reviewed, permanent, and
	1239	readily available reference, in sufficient detail so that
	1240	interoperability between independent implementations is possible.
	1241	Subject to these constraints, name assignments are First Come, First
	1242	Served - see RFC2434 [13]. Legal ID values are in the range 1..255.
	1243
	1244	This document does not define any values in this number space.
	1245
	1246
	1247	16. REFERENCES
	1248
	1249	[1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
	1250	Practical Reusable Unix Software, Editor B. Krishnamurthy,
	1251	John Wiley & Sons, Inc., 1995.
	1252
	1253	[2] J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data
	1254	Compression, IEEE Trans. on Information Theory, 23(3):337-343, 1977.
	1255
	1256	[3] W. Tichy, The String-to-String Correction Problem with Block Moves,
	1257	ACM Transactions on Computer Systems, 2(4):309-321, November 1984.
	1258
	1259	[4] E.M. McCreight, A Space-Economical Suffix Tree Construction
	1260	Algorithm, Journal of the ACM, 23:262-272, 1976.
	1261
	1262	[5] J.J. Hunt, K.P. Vo, W. Tichy, An Empirical Study of Delta Algorithms,
	1263	IEEE Software Configuration and Maintenance Workshop, 1996.
	1264
	1265	[6] J.J. Hunt, K.P. Vo, W. Tichy, Delta Algorithms: An Empirical Analysis,
	1266	ACM Trans. on Software Engineering and Methodology, 7:192-214, 1998.
	1267
	1268	[7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library,
	1269	Proc. of the Summer '91 Usenix Conference, 1991.
	1270
	1271	[8] D. W. Jones, Application of Splay Trees to Data Compression,
	1272	CACM, 31(8):996:1007.
	1273
	1274	[9] M. Nelson, J. Gailly, The Data Compression Book, ISBN 1-55851-434-1,
	1275	M&T Books, New York, NY, 1995.
	1276
	1277	[10] J.C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy,
	1278	Potential benefits of delta encoding and data compression for HTTP,
	1279	SIGCOMM '97, Cannes, France, 1997.
	1280
	1281	[11] J.C. Mogul, B. Krishnamurthy, F. Douglis, A. Feldmann,
	1282	Y. Goland, and A. Van Hoff, Delta Encoding in HTTP,
	1283	IETF, draft-mogul-http-delta-10, 2001.
	1284
	1285	[12] S. Bradner, Key words for use in RFCs to Indicate Requirement Levels,
	1286	RFC 2119, March 1997.
	1287
	1288	[13] T. Narten, H. Alvestrand, Guidelines for Writing an IANA
	1289	Considerations Section in RFCs, RFC2434, October 1998.
	1290
	1291
	1292
	1293	17. AUTHOR'S ADDRESS
	1294
	1295	Kiem-Phong Vo (main contact)
	1296	AT&T Labs, Room D223
	1297	180 Park Avenue
	1298	Florham Park, NJ 07932
	1299	Email: kpv@research.att.com
	1300	Phone: 1 973 360 8630
	1301
	1302	David G. Korn
	1303	AT&T Labs, Room D237
	1304	180 Park Avenue
	1305	Florham Park, NJ 07932
	1306	Email: dgk@research.att.com
	1307	Phone: 1 973 360 8602
	1308
	1309	Jeffrey C. Mogul
	1310	Western Research Laboratory
	1311	Compaq Computer Corporation
	1312	250 University Avenue
	1313	Palo Alto, California, 94305, U.S.A.
	1314	Email: JeffMogul@acm.org
	1315	Phone: 1 650 617 3304 (email preferred)
	1316
	1317	Joshua P. MacDonald
	1318	Computer Science Division
	1319	University of California, Berkeley
	1320	345 Soda Hall
	1321	Berkeley, CA 94720
	1322	Email: jmacd@cs.berkeley.edu

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: nikanabo/current/xdelta/diy/draft-korn-vcdiff.txt@ 447

Download in other formats: