Sysresccd-manual-en Backup and transfer your data using rsync

From SystemRescueCd

Jump to: navigation, search

Contents

Overview

Rsync is an open source file synchronization program. It's a very advanced tool that can be used to make backups, or to copy files across the network. It can also be used to make disk-to-disk copies of files.

It runs natively on Linux and Unix, but it's also available under Windows with cygwin. Then you can use rsync to make remote backups of your data files on windows, or it can replicate windows system backups created via ntbackup. On Linux, rsync can be combined with LVM snapshots to make online backups which are consistent.

The advantages of rsync is it's very efficient to copy large files as well as large amount of small files. It's very robust (it makes checksums), and very flexible. It's far better than scp/ftp/http for files transfers. It's based on a very intelligent algorithm that detects redundancies which helps to save a lot of bandwidth. Then rsync is an excellent tool if you need an offsite backup solution.

It support all types of files (flat files, links, hard links), extended attributes (xattr) and ACLs (advanced permissions). It can be used to make all sort of backups: full backups, differential backups, incremental backups, as long as you work on flat files, which is often the case on Linux/Unix systems.

Rsync also has drawbacks: for instance the options may not be obvious so you may make mistakes if you forget important options (such as the option to keep the advanced files attributes). You must also store your backups on an appropriate linux filesystem if you want to be able to store all these attributes. With archiving, the attributes would be stored in the archive, and then you could store the archive on any old generation filesystem.

Rsync is provided with SystemRescueCd, but this documentation is interesting in other contexts, so you don't have to use SystemRescueCd to follow the instructions given here.

Basic usage

Once compiled, rsync is installed as a single binary program which can be used as server (when started with option --daemon) or as client, so you don't need to have two programs. Rsync can be used to make files copies either on a local machine or across the network. Below are the three modes available and examples. The first option we use is -a to specify that rsync must preserve all the basic characteristics of the files (for instance a symbolic link will be copied as a link). You must be very careful with the trailing slash, it may have an effect.

file copy on a local machine

In case of a disk-to-disk copy, only one rsync process is involved, it can be used for local backups. Here is an example:

rsync -a /home/mydata/ /backups/data-20080810/

remote backup in standalone mode

In that case rsync is installed as a daemon listening on port tcp/873 at one end, and the client is running at the other end. It can work in both ways: the rsync client can be pulling files from the rsync daemon, or the rsync client can be pushing files to the rsync daemon. The rsync daemon can be installed either in a direct listen mode (the rsync process is directly listening the tcp port), or an inetd/xinetd deamon can be listening on the behalf of rsync. The second method is supposed to be more secure, since inetd/xinetd can do security checks. It also mean that you don't have to restart the service when you change the configuration file. of the rsync daemon.

Here is how the client can push a directory to a remote rsync server running a daemon:

rsync -a /home/mydata/ 192.168.1.1::mybackups/data-20080810/

The client could also download the files from the remote hosts:

rsync -a 192.168.1.1::mybackups/data-20080810/ /home/mydata/

You can notice there are two colons between the remote host address and the remote path where to copy the data.

remote backup over ssh

If the port tcp/873 is not open, you can also use rsync through ssh. All you have to do is to specify an option on the client side (rsync -e ssh). This process will connect to the ssh server of the specified address, and it will execute an rsync program remotely though the secure shell. An sshd server must be installed on the server side, and rsync must be installed at both ends, and the contents are encrypted because of ssh.

Here is how the client can push data to the remote host:

rsync -a -e ssh /home/mydata/ 192.168.1.1:/backups/data-20080810/

You can also transfer files in the other way:

rsync -a -e ssh 192.168.1.1:/backups/data-20080810/ /home/mydata/

You can notice there is only one colon between the remote host address and the remote path.

Transfer large data over a slow/unreliable network

We often need need to copy large amount of data, either several big files, or thousands of small files. In both cases scp/ftp/http are often unreliable, and resuming a transfer is a problem because we may have to check which files have already been copied. Then the process stops at 99% of a large file, it's also very complex to resume, and even when there is such an option (eg: wget -c) we may end up with a corrupt file. In both cases, we may be forced to transfer everything again just to be sure that the transfer was complete. So transfer can be nightmares with slow or unreliable network connections.

Since rsync is efficient on large files, rsync is download method provided, even though it's less popular that http and ftp. For instance, many linux distributions provide mirrors where you can use rsync to download iso images of the installation disc.

Transferring thousands of small files

Rsync provides very good handling for such situations. By default, rsync only transfers the files which needs to be copied. In other words, it first compare the source file and the destination files, and if they are different then it does a copy. This allows to save a lot of bandwidth in remote transfers, and it also saves a lot of time even in local backups. In case of regular backups, it means we just have to copy the files which have been modified since the previous backup.

By default, rsync uses the file modification-time and size to make the comparison. It considers two files with the same date and size as similar. It is a good behaviour in most cases. The comparison is very quick because rsync does not have to read the file contents, it just reads the file attributes. If you don't trust these two attributes (modification-time and size), you can specify -c or --checksum so that rsync makes a checksum of the files to be transferred. In case of remote copy it saves a lot of bandwidth anyway since only the checksums have to be transferred for identical files.

With the default behaviour (comparison based on the file modification-time and size), you can interrupt the transfer and run rsync again. The files which had already bee copied will have the same time & size so they are just ignored, so it will skip all the files which have already been copied. So rsync will just resume the transfer. You are sure that the files which rsync skip are not corrupt because rsync does checksumming. The only thing you may loose if you use the default options is the file which was being transferred when the process was stopped. If it's only small files it does not matter, but you may be interested in resuming transfer of large files with rsync.

Transferring large files

Rsync is able to resume transfer of large files when you use two options together: --partial and --inplace. The former option means "keep partially transferred files" and the latter means "update destination files in-place". These two options are necessary because by default rsync removes files which have been partially transferred when it's interrupted. If you loose the connection when the file was done at 99% it means you have to transfer everything again. And when --inplace is not used, rsync is working on temporary files with a random name during the transfer, and it renames the files once it's done. The problem is when you restart rsync, it can't find the partial destination file and so it just transfers the whole file again. When --inplace is specified, rsync is working on the original file name, and then it can be resumed. You should check that the option -u --update is not used, else the incomplete destination file may be considered as newer than the original file, and then the transfer won't be resumed.

Rsync uses a very efficient algorithm to compare the contents of two files (the source and destination files). It's able to detect redundancies even on large files. That way, it only transfers the different parts of the files, so it saves a lot of bandwidth on files with redundancies. So if you have to transfer several versions of the same files in a regular basis, you should just copy the old version (that you have already transferred) to the new destination file, and rsync will automatically skip all the common parts.

So you can transfer large files very efficiently when using rsync with --partial and --inplace. You can interrupt and resume the transfer as many times as you want and you won't loose what has already been done. And since rsync is checksumming the data, you are sure that the files won't be corrupt.

Here is an example of good tuning to copy a directory with large files remotely:

rsync -a --partial --inplace /home/bigfiles/ 192.168.1.1::mybackups/bigfiles/

You should also check that you are running up to date versions of rsync. The rsync algorithm and the protocol which help to save bandwidth have bee improved over the time. When the rync versions involved in a remote copy are different (client and server), rsync will just use the best common protocol available. So you should try to have a recent rsync installed at both end to optimize your transfers.

The most important rsync options

rsync comes with a lot of options. The purpose of this tutorial is not to be exhaustive, so here are described the most useful rsync options. Everytime it exists, both the short name and the long name of the options are given, you should only use one of them.

-a, --archive: preserve default attributes

This option is very important in most cases. It will ensure that rsync preserves all the files attributes (permissions, times, type of file, ...). With this option, the destination file will be similar to the original file. For instance, a symbolic link will be copied as a link. Else, the contents of the target file would be copied in the destination file. Be careful: the hard links, the extended attributes (xattr), and the ACLs are not preserved with this option, so you should also add -HAX

-u, --update: skip files that are newer on the receiver

Use this option only if you may have done modifications in the destination directory. For instance, if you are migrating data from an old server to a new one, and if people have already started working on the new server, you may use this option to be sure that rsync does not overwrite the changes with an older version. On normal backups, this option is useless.

-c, --checksum: skip based on checksum, not mod-time & size

By default rsync just compares the modification-time and the size of a file to know if it has been modified. This is safe and quick in most cases. Nevertheless, if you consider that two files with the same size and modification-time may have been changed, or if you just want to be absolutely sure that the contents is similar, you can use this option.

-x, --one-file-system: don't cross filesystem boundaries

If you use rsync to make backups of a live machine, you may want to backup only one filesystem at a time. By default, all the files which are seen are copied. So if you backup your root filesystem with rsync, all the mounted filesystems are backed up in the same time. To prevent this default behaviour, just use this option.

-z, --compress: compress file data during the transfer

Rsync can compress the data that are transferred to save bandwidth. It just means that the data are compressed during the transferred, the destination file will be the same as the original one. You should use it for remote transfers if you think you are copying files which are uncompressed and with a good compression radio (eg: large text files, or raw images). This option is not efficient on files which are already compressed: zip, gz, bz2, jpeg, ...

--inplace and --partial

Use these options if you want to transfer large files and to be sure that the transfer will resume at the same point in case of connection failure. See the sections about Transferring large files for more details.

--progress: show progress during transfer

It is just an option to display the progression of the transfer, so it's useful when you run rsync by hand.

--delete: delete extraneous files from destination dirs

By default, rsync leaves the files which are in the destination directory and which are not in the source directory. It can be a problem for backups: when data files are removed in the source directory, they remains in the backup directory. Uee this option if you want rsync to remove the files which are only in the destination directory.

--remove-source-files: sender removes synchronized files (non-dir)

You may use this option if you want rsync to move your data to the destination. By default, rsync makes a copy. With this option, the source files to be deleted by rsync if the transfer was successful.

--exclude=pattern: exclude files matching pattern

Use this option when you want to exclude files or directories from the transfer. For instance, you may want to exclude the temporary files or other useless data when you make a backup.

Rsync return status

As any program, rsync returns an integer when it exits. This return status can be used to check whether or not the transfer was successful. When rsync returns 0, it means that the transfer was successful. All the other status means there was an error. Anyway, you may run rsync on a live system. For instance, you can run rsync on a server every night to make a backups of its root file system. On a live system, new files are created every time. So rsync will probably complain about files which have vanished during the transfer (like temporary files). For this reason, we may often consider the return status 23 and 24 as success on a live system:

  • 22: Error allocating core memory buffers
  • 23: Partial transfer due to error

How to install and configure the rsync daemon

You will probably need to install an rsync daemon if you want to transfer data over the network. The daemon is the rsync process which is started in the background and which listens on a tcp ports to manage incoming connections from an rsync client. The daemon can be installed either as a standalone service (use /etc/init.d/rsync to manage the service), or as a module of inetd/xinetd.

The configuration file for the daemon is stored in either /etc/rsyncd.conf or /etc/rsync/rsyncd.conf by default. This is the file you have to edit to change the settings of the daemon. When you change the configuration, don't forget to restart the rsync daemon (or just send a HUP signal using kill -HUP <pid-of-rsync>). It's not necessary if you are using inetd/xinetd, since the configuration is read each time a client connects. In that case, don't forget to enable the inetd/xinetd service, it may be stopped.

When you install an rsync daemon, you must keep in mind that it may have to be secured. You can use the following techniques:

  • the daemon requires a password to connect
  • you can allow only several specific IP addresses to connect
  • you can provide a read-only access to the client

If you protect the access using a password, it's recommended that the password used by the rsync client is written in a text file instead of being passed in the command line. It will prevent other people who run ps -aux to see the password.

Here is an example of basic an secured configuration for an rsync daemon:

# ======================/etc/rsyncd.conf======================
pid file = /var/run/rsyncd.pid
read only = yes
uid = root
gid = root

[share1]
    path = /mnt/share1
    read only = yes
    hosts allow = 192.168.1.1, 10.88.45.0/24

[backups]
    path = /var/tmp/catalyst/tmp
    read only = no
    hosts allow = 192.168.1.1, 10.88.45.0/24

[rootfs]
    path = /
    read only = yes
    hosts allow = 192.168.1.1, 10.88.45.0/24

[upload]
    path = /upload
    read only = no
    hosts allow = 172.16.0.0/16

How to install rsync on windows with cygwin

You can also install an rsync daemon on Windows by installing the cygwin environment. Cygwin provides a linux compatible environment that runs on windows. It's not emulated programs: it's just the software we use on linux has been compiled to run on windows, so it's a native windows executable which is speaking to the windows kernel directly.

To install cygwin, you just have to run the setup.exe program provided on the official website, and install the packages you need. The installer downloads the binaries packages and install them on the hard drive. When the installation is done, you just have to click on the icon to run a window with a bash shell. You have nothing more to do to install the rsync client on windows. Just keep in mind that the hard disks as seen in /cygdrive/c/, /cygdrive/d/, ... when working under cygwin.

You can also install rsync as a daemon on cygwin. To install services in cygwin, you can use a special program named cygrunsrv.exe. It installs a cygwin service as a normal windows service, so that it can be automatically started at windows boot time. That way you don't have to start the daemon by hand.

People reported that rsync does not work well on the top of ssh on cygwin, so you should really use the standalone daemon. Here is the command to use:

cygrunsrv.exe -I "cygrsyncd" -p /usr/bin/rsync.exe -a "--config=/etc/rsyncd.conf --daemon --no-detach"

Then, start the new service (run services.msc and start the service named "cygrsyncd")

Personal tools