-
Notifications
You must be signed in to change notification settings - Fork 53
How to use openvcdiff
open-vcdiff
is an encoder and decoder that can write and read the VCDIFF format described in RFC 3284 : The VCDIFF Generic Differencing and Compression Data Format. The encoder uses the Bentley/McIlroy technique for finding matches between the source and target data.
An encoder/decoder named Xdelta, which also reads and writes the VCDIFF format, has already been released by Josh MacDonald under the GNU General Public License v2.open-vcdiff
is released under the Apache 2.0 license, which makes it suitable for use in applications that will not necessarily be released as open-source.
The primary purpose for is to be included in implementations of the Shared-Dictionary Compression over HTTP (SDCH, or "Sandwich") protocol. Please join the SDCH Google Group if you want to find out more about SDCH.
Note: In this document, and in the source code, the term "dictionary" is used interchangeably with the term "source" (or "source file" or "source data") as defined in RFC 3284.
open-vcdiff
conforms to the VCDIFF draft standard described in RFC 3284. The following non-standard extensions to the format can be enabled if desired:
-
Interleaved format. The VCDIFF draft standard format divides each encoded delta window into three sections (data, instructions, and addresses), with the aim of improving compressibility of the encoded file using a secondary compressor such as gzip. The drawback to this approach is that none of the target data can be reconstructed unless the entire delta window is available. The delta window is received in packets over the network and it is desirable to be able to process its contents as they arrive. In order to facilitate decoding a stream of packets from the network, we have modified the VCDIFF format so that it interleaves the data, instructions, and addresses instead of placing them in three separate sections. Each instruction is followed by its size and then by an address or literal data. This feature is enabled by passing the format flag
VCD_FORMAT_INTERLEAVED
to the C++ encoder interface, or by specifying-interleaved
on thevcdiff
command line. -
Adler32 checksum. The format can be modified to include an Adler32 checksum of the target window data. If the checksum format is used, then bit 2 (
0x04
, defined asVCD_CHECKSUM
) of theWin_Indicator
byte will be set, and the checksum will appear just after the "Length of addresses for COPYs" field and before the "Data section for ADDs and RUNs" section in the encoding. The decoder will verify the decoded target data against the checksum, if it is present. This feature is enabled by passing the format flagVCD_FORMAT_CHECKSUM
to the C++ encoder interface, or by specifying-checksum
on thevcdiff
command line. This checksum format is not compatible with the Adler32 checksum used by Xdelta. -
Version header byte (Header4). If either of the two enhancements described above is used, then the resulting format will not conform to the VCDIFF draft standard as described in RFC 3284. In order to indicate this deviation from the standard, the fourth byte in the encoding (
Header4
, reserved for the VCDIFF version code) will be set to0x53
(a capital "S" character in ASCII.) If neither enhancement is used, the fourth byte may be0x00
(a null character), the default value described in the standard.
The package has been built and tested using the Autoconf/Automake build system on Red Hat Linux, Ubuntu Linux, Cygwin on Windows, Mac OS X, and Solaris 10, using gcc versions 3.2.2, 3.4.4, and 4.0.3. Other flavors of Unix may require some minor changes.
On Microsoft Windows, you can build the package using either Cygwin or Visual Studio 2005. See below for details on the latter.
Of course, the authors will be delighted to receive submissions of patches that will add support for additional operating systems and compilers.
Some notes about porting open-vcdiff
to specific environments follow.
Solution and project files for Microsoft Visual Studio 2005 are provided in the vsprojects directory of the CVS source tree (but not the source tarball.)
In order to build open-vcdiff on OS X, you will need to download and install Xcode if it is not already installed on your machine.
On Solaris 10, if you run across a build error "libstdc++.la is not a valid libtool archive"
, please refer to this Sun forum post for a workaround.
See the INSTALL file for (generic) installation instructions for C++: basically:
./configure
make
sudo make install
make
will compile and link the open-vcdiff
libraries and unit tests as well as vcdiff
, a simple command-line utility to run the encoder and decoder. make install
will create a google subdirectory under /usr/local/include
if it does not already exist.
Typical usage of vcdiff
is as follows (the < and > are file redirect operations, not optional arguments):
vcdiff encode -dictionary file.dict < target_file > delta_file
vcdiff decode -dictionary file.dict < delta_file > target_file
To see the command-line syntax of vcdiff
, use vcdiff -help
or just vcdiff
.
To call the encoder from C++ code, assuming that dictionary, target, and delta are all std::string
objects:
#include <google/vcencoder.h> // Read this file for interface details
[...]
open_vcdiff::VCDiffEncoder encoder(dictionary.data(), dictionary.size());
encoder.SetFormatFlags(open_vcdiff::VCD_FORMAT_INTERLEAVED);
encoder.Encode(target.data(), target.size(), &delta);
Calling the decoder is just as simple:
#include <google/vcdecoder.h> // Read this file for interface details
[...]
open_vcdiff::VCDiffDecoder decoder;
decoder.Decode(dictionary.data(), dictionary.size(), delta, &target);
When using the encoder, the C++ application must be linked with the library options -lvcdcom
and -lvcdenc
; when using the decoder, it must be linked with -lvcdcom
and -lvcddec
.
The preceding examples use the simple interface to the encoder and decoder, which assume that the entire target file to be encoded, or the entire delta file to be decoded, is immediately available. There is also a streaming interface which can be used when the target or delta data is received incrementally.
Encoding target files using the streaming encoder involves the following steps:
- Include the header file
<google/vcencoder.h>
. - Load the dictionary into memory. If the dictionary is stored in a file, this can be done with the Unix system call mmap.
- Create a
HashedDictionary
object using the dictionary address and length.
- A pointer to this object can be retained and used for many encoding operations. A pointer to a single
const HashedDictionary
object can be shared and used concurrently by multiple encoding threads. - Call the
Init()
method on theHashedDictionary
object after creating it.
- Create a
VCDiffStreamingEncoder
object using theHashedDictionary
object plus the following additional arguments.
- A set of format extensions for the encoder to use. Standard format is represented by
VCD_STANDARD_FORMAT
. To use interleaved format and/or checksum format, use the format flagsVCD_FORMAT_INTERLEAVED
,VCD_FORMAT_CHECKSUM
, orVCD_FORMAT_INTERLEAVED | VCD_FORMAT_CHECKSUM
. - The parameter
look_for_target_matches
controls whether the encoder will look for target matches within the previously encoded target data. In our testing, we have found that it is best to set this parameter tofalse
ifgzip
is to be applied to the delta file after VCDIFF encoding.
- For each target file to be encoded:
- Create a string object in which to store the delta encoding. This is normally a
std::string
. With some specialization of theOutputString
template class, the output string can be any type that supportsappend()
,size()
, etc. See the header file<google/output_string.h>
for details. - Call the
StartEncoding()
method on theVCDiffStreamingEncoder
object. - Loop through reading as much target data as possible. Each time more data arrives, call
EncodeChunk()
, and process any delta data that has been appended to the output string. - When all target data is exhausted, call
FinishDecoding()
and process any additional delta data that has been appended to the output string. - If any of these methods returns false, an error has occurred and has been logged to stderr. In that case, do not continue with the encoding operation.
- Remember to link the code with the library options
-lvcdcom
and-lvcdenc
.
Likewise, decoding delta files using the streaming decoder involves the following steps:
- Include the header file
<google/vcdecoder.h>
. - Load the dictionary into memory. If the dictionary is stored in a file, this can be done with the Unix system call mmap.
- Create a
VCDiffStreamingDecoder
object. - For each delta file to be decoded:
- Create a string object in which to store the decoded target. This is normally a
std::string
, but see the encoder instructions above which describe how to use a different type. - Call the
StartDecoding()
method on theVCDiffStreamingDecoder
object using the dictionary address and length. - Loop through reading as much delta data as possible. Each time more data arrives, call
DecodeChunk()
, and process any target data that has been appended to the output string. - When all delta data is exhausted, call
FinishDecoding()
and process any additional target data that has been appended to the output string. - If
DecodeChunk()
orFinishDecoding()
returns false, an error has occurred and has been logged to stderr. In that case, do not continue with the decoding operation.
- Link the code with the library options
-lvcdcom
and-lvcddec
.
For simple examples of how to use the streaming encoder and decoder, please see the comments in the header files <google/vcencoder.h>
and <google/vcdecoder.h>
. For an example of a full application that uses these interfaces, please see the source file vcdiff_main.cc
, included in this package, which implements the command-line client.
To verify that the package works on your system, especially after making modifications to the source code, please run the unit tests using make check. If you find that the unit tests fail on your system without having made any changes to the code, please contact opensource@google.com.
The Google C++ Style Guide has been followed as much as possible within this package, and we ask contributors to familiarize themselves with those guidelines and follow them when making modifications to the code.
The authors can be reached by e-mail at opensource@google.com.