mirror of
http://galexander.org/git/simplesshd.git
synced 2025-01-01 10:40:56 +00:00
311 lines
14 KiB
TeX
311 lines
14 KiB
TeX
|
\documentclass[a4paper]{article}
|
||
|
\begin{document}
|
||
|
|
||
|
|
||
|
\title{The rsync algorithm}
|
||
|
|
||
|
\author{Andrew Tridgell \quad\quad Paul Mackerras\\
|
||
|
Department of Computer Science \\
|
||
|
Australian National University \\
|
||
|
Canberra, ACT 0200, Australia}
|
||
|
|
||
|
\maketitle
|
||
|
|
||
|
\begin{abstract}
|
||
|
This report presents an algorithm for updating a file on one machine
|
||
|
to be identical to a file on another machine. We assume that the
|
||
|
two machines are connected by a low-bandwidth high-latency
|
||
|
bi-directional communications link. The algorithm identifies parts
|
||
|
of the source file which are identical to some part of the
|
||
|
destination file, and only sends those parts which cannot be matched
|
||
|
in this way. Effectively, the algorithm computes a set of
|
||
|
differences without having both files on the same machine. The
|
||
|
algorithm works best when the files are similar, but will also
|
||
|
function correctly and reasonably efficiently when the files are
|
||
|
quite different.
|
||
|
\end{abstract}
|
||
|
|
||
|
\section{The problem}
|
||
|
|
||
|
Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be
|
||
|
the same as $A$. The obvious method is to copy $A$ onto $B$.
|
||
|
|
||
|
Now imagine that the two files are on machines connected by a slow
|
||
|
communications link, for example a dialup IP link. If $A$ is large,
|
||
|
copying $A$ onto $B$ will be slow. To make it faster you could
|
||
|
compress $A$ before sending it, but that will usually only gain a
|
||
|
factor of 2 to 4.
|
||
|
|
||
|
Now assume that $A$ and $B$ are quite similar, perhaps both derived
|
||
|
from the same original file. To really speed things up you would need
|
||
|
to take advantage of this similarity. A common method is to send just
|
||
|
the differences between $A$ and $B$ down the link and then use this
|
||
|
list of differences to reconstruct the file.
|
||
|
|
||
|
The problem is that the normal methods for creating a set of
|
||
|
differences between two files rely on being able to read both files.
|
||
|
Thus they require that both files are available beforehand at one end
|
||
|
of the link. If they are not both available on the same machine,
|
||
|
these algorithms cannot be used (once you had copied the file over,
|
||
|
you wouldn't need the differences). This is the problem that rsync
|
||
|
addresses.
|
||
|
|
||
|
The rsync algorithm efficiently computes which parts of a source file
|
||
|
match some part of an existing destination file. These parts need not
|
||
|
be sent across the link; all that is needed is a reference to the part
|
||
|
of the destination file. Only parts of the source file which are not
|
||
|
matched in this way need to be sent verbatim. The receiver can then
|
||
|
construct a copy of the source file using the references to parts of
|
||
|
the existing destination file and the verbatim material.
|
||
|
|
||
|
Trivially, the data sent to the receiver can be compressed using any
|
||
|
of a range of common compression algorithms, for further speed
|
||
|
improvements.
|
||
|
|
||
|
\section{The rsync algorithm}
|
||
|
|
||
|
Suppose we have two general purpose computers $\alpha$ and $\beta$.
|
||
|
Computer $\alpha$ has access to a file $A$ and $\beta$ has access to
|
||
|
file $B$, where $A$ and $B$ are ``similar''. There is a slow
|
||
|
communications link between $\alpha$ and $\beta$.
|
||
|
|
||
|
The rsync algorithm consists of the following steps:
|
||
|
|
||
|
\begin{enumerate}
|
||
|
\item $\beta$ splits the file $B$ into a series of non-overlapping
|
||
|
fixed-sized blocks of size S bytes\footnote{We have found that
|
||
|
values of S between 500 and 1000 are quite good for most purposes}.
|
||
|
The last block may be shorter than $S$ bytes.
|
||
|
|
||
|
\item For each of these blocks $\beta$ calculates two checksums:
|
||
|
a weak ``rolling'' 32-bit checksum (described below) and a strong
|
||
|
128-bit MD4 checksum.
|
||
|
|
||
|
\item $\beta$ sends these checksums to $\alpha$.
|
||
|
|
||
|
\item $\alpha$ searches through $A$ to find all blocks of length $S$
|
||
|
bytes (at any offset, not just multiples of $S$) that have the same
|
||
|
weak and strong checksum as one of the blocks of $B$. This can be
|
||
|
done in a single pass very quickly using a special property of the
|
||
|
rolling checksum described below.
|
||
|
|
||
|
\item $\alpha$ sends $\beta$ a sequence of instructions for
|
||
|
constructing a copy of $A$. Each instruction is either a reference
|
||
|
to a block of $B$, or literal data. Literal data is sent only for
|
||
|
those sections of $A$ which did not match any of the blocks of $B$.
|
||
|
\end{enumerate}
|
||
|
|
||
|
The end result is that $\beta$ gets a copy of $A$, but only the pieces
|
||
|
of $A$ that are not found in $B$ (plus a small amount of data for
|
||
|
checksums and block indexes) are sent over the link. The algorithm
|
||
|
also only requires one round trip, which minimises the impact of the
|
||
|
link latency.
|
||
|
|
||
|
The most important details of the algorithm are the rolling checksum
|
||
|
and the associated multi-alternate search mechanism which allows the
|
||
|
all-offsets checksum search to proceed very quickly. These will be
|
||
|
discussed in greater detail below.
|
||
|
|
||
|
\section{Rolling checksum}
|
||
|
|
||
|
The weak rolling checksum used in the rsync algorithm needs to have
|
||
|
the property that it is very cheap to calculate the checksum of a
|
||
|
buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and
|
||
|
the values of the bytes $X_1$ and $X_{n+1}$.
|
||
|
|
||
|
The weak checksum algorithm we used in our implementation was inspired
|
||
|
by Mark Adler's adler-32 checksum. Our checksum is defined by
|
||
|
$$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$
|
||
|
$$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$
|
||
|
$$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$
|
||
|
|
||
|
where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$.
|
||
|
For simplicity and speed, we use $M = 2^{16}$.
|
||
|
|
||
|
The important property of this checksum is that successive values can
|
||
|
be computed very efficiently using the recurrence relations
|
||
|
|
||
|
$$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$
|
||
|
$$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$
|
||
|
|
||
|
Thus the checksum can be calculated for blocks of length S at all
|
||
|
possible offsets within a file in a ``rolling'' fashion, with very
|
||
|
little computation at each point.
|
||
|
|
||
|
Despite its simplicity, this checksum was found to be quite adequate as
|
||
|
a first-level check for a match of two file blocks. We have found in
|
||
|
practice that the probability of this checksum matching when the
|
||
|
blocks are not equal is quite low. This is important because the much
|
||
|
more expensive strong checksum must be calculated for each block where
|
||
|
the weak checksum matches.
|
||
|
|
||
|
\section{Checksum searching}
|
||
|
|
||
|
Once $\alpha$ has received the list of checksums of the blocks of $B$,
|
||
|
it must search $A$ for any blocks at any offset that match the
|
||
|
checksum of some block of $B$. The basic strategy is to compute the
|
||
|
32-bit rolling checksum for a block of length $S$ starting at each
|
||
|
byte of $A$ in turn, and for each checksum, search the list for a
|
||
|
match. To do this our implementation uses a
|
||
|
simple 3 level searching scheme.
|
||
|
|
||
|
The first level uses a 16-bit hash of the 32-bit rolling checksum and
|
||
|
a $2^{16}$ entry hash table. The list of checksum values (i.e., the
|
||
|
checksums from the blocks of $B$) is sorted according to the 16-bit
|
||
|
hash of the 32-bit rolling checksum. Each entry in the hash table
|
||
|
points to the first element of the list for that hash value, or
|
||
|
contains a null value if no element of the list has that hash value.
|
||
|
|
||
|
At each offset in the file the 32-bit rolling checksum and its 16-bit
|
||
|
hash are calculated. If the hash table entry for that hash value is
|
||
|
not a null value, the second-level check is invoked.
|
||
|
|
||
|
The second-level check involves scanning the sorted checksum list
|
||
|
starting with the entry pointed to by the hash table entry, looking
|
||
|
for an entry whose 32-bit rolling checksum matches the current value.
|
||
|
The scan terminates when it reaches an entry whose 16-bit hash
|
||
|
differs. If this search finds a match, the third-level check is
|
||
|
invoked.
|
||
|
|
||
|
The third-level check involves calculating the strong checksum for the
|
||
|
current offset in the file and comparing it with the strong checksum
|
||
|
value in the current list entry. If the two strong checksums match,
|
||
|
we assume that we have found a block of $A$ which matches a block of
|
||
|
$B$. In fact the blocks could be different, but the probability of
|
||
|
this is microscopic, and in practice this is a reasonable assumption.
|
||
|
|
||
|
When a match is found, $\alpha$ sends $\beta$ the data in $A$ between
|
||
|
the current file offset and the end of the previous match, followed by
|
||
|
the index of the block in $B$ that matched. This data is sent
|
||
|
immediately a match is found, which allows us to overlap the
|
||
|
communication with further computation.
|
||
|
|
||
|
If no match is found at a given offset in the file, the rolling
|
||
|
checksum is updated to the next offset and the search proceeds. If a
|
||
|
match is found, the search is restarted at the end of the matched
|
||
|
block. This strategy saves a considerable amount of computation for
|
||
|
the common case where the two files are nearly identical. In
|
||
|
addition, it would be a simple matter to encode the block indexes as
|
||
|
runs, for the common case where a portion of $A$ matches a series of
|
||
|
blocks of $B$ in order.
|
||
|
|
||
|
\section{Pipelining}
|
||
|
|
||
|
The above sections describe the process for constructing a copy of one
|
||
|
file on a remote system. If we have a several files to copy, we can
|
||
|
gain a considerable latency advantage by pipelining the process.
|
||
|
|
||
|
This involves $\beta$ initiating two independent processes. One of the
|
||
|
processes generates and sends the checksums to $\alpha$ while the
|
||
|
other receives the difference information from $\alpha$ and
|
||
|
reconstructs the files.
|
||
|
|
||
|
If the communications link is buffered then these two processes can
|
||
|
proceed independently and the link should be kept fully utilised in
|
||
|
both directions for most of the time.
|
||
|
|
||
|
\section{Results}
|
||
|
|
||
|
To test the algorithm, tar files were created of the Linux kernel
|
||
|
sources for two versions of the kernel. The two kernel versions were
|
||
|
1.99.10 and 2.0.0. These tar files are approximately 24MB in size and
|
||
|
are separated by 5 released patch levels.
|
||
|
|
||
|
Out of the 2441 files in the 1.99.10 release, 291 files had changed in
|
||
|
the 2.0.0 release, 19 files had been removed and 25 files had been
|
||
|
added.
|
||
|
|
||
|
A ``diff'' of the two tar files using the standard GNU diff utility
|
||
|
produced over 32 thousand lines of output totalling 2.1 MB.
|
||
|
|
||
|
The following table shows the results for rsync between the two files
|
||
|
with a varying block size.\footnote{All the tests in this section were
|
||
|
carried out using rsync version 0.5}
|
||
|
|
||
|
\vspace*{5mm}
|
||
|
\begin{tabular}{|l|l|l|l|l|l|l|} \hline
|
||
|
{\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\
|
||
|
{\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline
|
||
|
|
||
|
300 & 64247 & 3817434 & 948 & 5312200 & 5629158 & 1632284 \\ \hline
|
||
|
500 & 46989 & 620013 & 64 & 1091900 & 1283906 & 979384 \\ \hline
|
||
|
700 & 33255 & 571970 & 22 & 1307800 & 1444346 & 699564 \\ \hline
|
||
|
900 & 25686 & 525058 & 24 & 1469500 & 1575438 & 544124 \\ \hline
|
||
|
1100 & 20848 & 496844 & 21 & 1654500 & 1740838 & 445204 \\ \hline
|
||
|
\end{tabular}
|
||
|
\vspace*{5mm}
|
||
|
|
||
|
In each case, the CPU time taken was less than the
|
||
|
time it takes to run ``diff'' on the two files.\footnote{The wall
|
||
|
clock time was approximately 2 minutes per run on a 50 MHz SPARC 10
|
||
|
running SunOS, using rsh over loopback for communication. GNU diff
|
||
|
took about 4 minutes.}
|
||
|
|
||
|
The columns in the table are as follows:
|
||
|
|
||
|
\begin{description}
|
||
|
\item [block size] The size in bytes of the checksummed blocks.
|
||
|
\item [matches] The number of times a block of $B$ was found in $A$.
|
||
|
\item [tag hits] The number of times the 16-bit hash of the rolling
|
||
|
checksum matched a hash of one of the checksums from $B$.
|
||
|
\item [false alarms] The number of times the 32-bit rolling checksum
|
||
|
matched but the strong checksum didn't.
|
||
|
\item [data] The amount of file data transferred verbatim, in bytes.
|
||
|
\item [written] The total number of bytes written by $\alpha$,
|
||
|
including protocol overheads. This is almost all file data.
|
||
|
\item [read] The total number of bytes read by $\alpha$, including
|
||
|
protocol overheads. This is almost all checksum information.
|
||
|
\end{description}
|
||
|
|
||
|
The results demonstrate that for block sizes above 300 bytes, only a
|
||
|
small fraction (around 5\%) of the file was transferred. The amount
|
||
|
transferred was also considerably less than the size of the diff file
|
||
|
that would have been transferred if the diff/patch method of updating
|
||
|
a remote file was used.
|
||
|
|
||
|
The checksums themselves took up a considerable amount of space,
|
||
|
although much less than the size of the data transferred in each
|
||
|
case. Each pair of checksums consumes 20 bytes: 4 bytes for the
|
||
|
rolling checksum plus 16 bytes for the 128-bit MD4 checksum.
|
||
|
|
||
|
The number of false alarms was less than $1/1000$ of the number of
|
||
|
true matches, indicating that the 32-bit rolling checksum is quite
|
||
|
good at screening out false matches.
|
||
|
|
||
|
The number of tag hits indicates that the second level of the
|
||
|
checksum search algorithm was invoked about once every 50
|
||
|
characters. This is quite high because the total number of blocks in
|
||
|
the file is a large fraction of the size of the tag hash table. For
|
||
|
smaller files we would expect the tag hit rate to be much closer to
|
||
|
the number of matches. For extremely large files, we should probably
|
||
|
increase the size of the hash table.
|
||
|
|
||
|
The next table shows similar results for a much smaller set of files.
|
||
|
In this case the files were not packed into a tar file first. Rather,
|
||
|
rsync was invoked with an option to recursively descend the directory
|
||
|
tree. The files used were from two source releases of another software
|
||
|
package called Samba. The total source code size is 1.7 MB and the
|
||
|
diff between the two releases is 4155 lines long totalling 120 kB.
|
||
|
|
||
|
\vspace*{5mm}
|
||
|
\begin{tabular}{|l|l|l|l|l|l|l|} \hline
|
||
|
{\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\
|
||
|
{\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline
|
||
|
|
||
|
300 & 3727 & 3899 & 0 & 129775 & 153999 & 83948 \\ \hline
|
||
|
500 & 2158 & 2325 & 0 & 171574 & 189330 & 50908 \\ \hline
|
||
|
700 & 1517 & 1649 & 0 & 195024 & 210144 & 36828 \\ \hline
|
||
|
900 & 1156 & 1281 & 0 & 222847 & 236471 & 29048 \\ \hline
|
||
|
1100 & 921 & 1049 & 0 & 250073 & 262725 & 23988 \\ \hline
|
||
|
\end{tabular}
|
||
|
\vspace*{5mm}
|
||
|
|
||
|
|
||
|
\section{Availability}
|
||
|
|
||
|
An implementation of rsync which provides a convenient interface
|
||
|
similar to the common UNIX command rcp has been written and is
|
||
|
available for download from http://rsync.samba.org/
|
||
|
|
||
|
\end{document}
|