Incremental and Static Backups with git and rsync

From FachschaftSprachwissenschaft
Revision as of 21:23, 9 July 2008 by DrNI (Talk | contribs)

Jump to: navigation, search

This page will introduce you into one of the most underestimated necessities in the computing world: Backups. It's really just the documentation of my own backup system I installed after I had a disk crash (from which I recovered due to sheer luck, because the disk didn't crash fully and I had some fairly recent backups from a system upgrade).

Admin 03:11, 5 July 2008 (CEST)

Please feel free to add suggestions and additional methods to this page if they involve extending the methods presented here. If you have your own way of backing up files, than we might want to consider putting this under a different, more general title or even creating a collection of pages related to this.

Incremental and Static Backups with git and rsync

This article covers the usage of git and rsync with regard to maintaining a backup of your system. While it helps to be familiar with both tools, we will try to give a comprehensive explanation of the techniques you might want to use.

While rsync is generally the best method for backing up your static content, people disagree on the choice of VCS to use for revision control. There are also solutions that only use rsync's incremental backup capabilities. Feel free to add documentation and/or links to resources if you think they might be valuable to others.

What you need

You'll need a few things in place before you can start:

  • Some UNIX-like operating system. This guide does not cover Windows. (For Windows, there is a rather fancy rsync-based backup solution presented by the German computer magazine c't: rsyncbackup.vbs)
  • git, a cron daemon and rsync installed (packaged in virtually all distributions.)
  • A computer, obviously. I'm using a Laptop for day-to-day commuting and work at the university and a desktop computer at home as a server and workstation. My system has the positive side-effect of keeping my laptop and desktop in sync, so I can seamlessly switch working places.
  • Some external method of saving your backup. Backups are supposed to be on an entirely different machine, kept at a different place from your working environment. Different place, as in, another house, town, country, continent. Why? Fires, burglars, republicans and natural disasters. It'll happen. To you.

Different Data, Different Backups

Using two entirely different approaches for backup on the same system might seem odd at first. My choice of doing this depended on my the right tool for the right job credo. For some data, I want to store it not only once, but track it as it changes. But for other data that doesn't frequently change, this doesn't make sense. So what are the different possibilities we have here?

Types of data

You will have to think about three different categories your data can fall into:

  • Frequently changing data
  • Mostly static data
  • Data easily recoverable (like downloaded versions of programs or entire image files) and therefore not worth backing up

Dynamic Storage: Content Under Revision Control

Dynamic data is volatile content that frequently changes - the content you typically do your work on. Code, papers, web pages, homeworks, writings, personal data, etc... This content can be kept track of, as it changes over time, since it actually does change over time. And you know it: damn, that thing used to work only yesterday! Or: I know I had a nice formulation for this last Monday, why did I delete it? It happens!

Static Storage: Data

What happens to all your papers, binaries, downloaded files, music and image collections, videos, etc...? First, you have to decide what you actually want to keep track of. Downloaded data doesn't seem to be the right thing to clobber up your free space with, as you can fetch it off the Net anytime. Your image collection on the other hand, is important, so you will want to back it up.

Layout of Your Home Directory

It is not a requirement, but it helps to have a clean home directory, where the folders are sorted according to their role of keeping one specific type of data. This way you will keep your setup easy and simple - and it usually has the added benefit of helping you finding your stuff more easily. Note that it's not tragic if data you don't want to keep track of makes it into your static backup folders. It is also not tragic if static data makes it into the dynamic folders. But the other way round it's not safe: you should keep important data where it belongs! My home directory is laid out the following way:

* bin   -- scripts and programs I wrote myself and put here so they are added to
           my $PATH dat -- what I work on. Java exercises, Semantics homework, Jobs..
           code is kept in src/
- etc   -- various content
* html  -- my public_html folder, with a nicer name. Primarily for testing and
           writing web pages using my local apache server.
x mail  -- local mail repository. You won't have that if you don't use fetchmail
           or similar to organize your email locally
- music -- the obvious 
x opt   -- optional  installations that need to be in the
           user directory, like Eclipse
* src   -- most precious folder. All the source's in it.
x temp  -- temporary files
x var   -- volatile content like downloads I don't mind
*: git -: rsync x: not backed up

This is in noway how you should do it; it's only here so you can understand how the rest of the backup strategy works. You will want to come up with your own hierarchy, tailored to your needs.

Using git to Track Your Work

Revision control systems are typically aimed at source code, but git is exceptionally well suited for binary files, too. This means, you can also store open office documents or other binary formats you work in. As another benefit, git is also very fast and space/time efficient. Another benefit of git that might not be immediately apparent is its distributed nature. This way you can keep local branches for different machines and manage a master content branch that is shared between your systems.

Single Branch Setup

Initializing the Repository

Let us begin with the easier task of creating our first branch. You will want to have only one branch if you work on your data only at one computer and back up to a non-interactive medium, like a NAS/external drive or a hosted backup solution.

To initialize your git environment, say

~$ git-init

Then you need to do some setup. Add your name and email to git's config:

~$ git-config --global "Jane Doe"
~$ git-config --global

The --global flag tells git to use these data for every git project you start or commit to. If you decide to track one of your projects with git and share it with others, they will know who you are and how to contact you. Besides, we're just making you familiar with git ;-)

Next, you need to tell git what to keep track of, and what to exclude. Edit the file ~/.gitignore:

## Everything you put in here, will be ignored. Globbing allowed, / is git's root, not your file system's root

# ignore folders with contents we do not care about. We don't care about stuff we want to use rsync to back up.

# A prefixed ! will exclude a certain pattern from being matched for ignorance:


# You will probably want to ignore most if not all of the dotfiles in your home. Most of them are just not worth keeping track of:

# But some files may contain valuable configuration!

# We do not want to keep track of keeping-track-of systems git doesn't know about!
# Note how those paths do not carry an absolute prefix. This means that they will be matched anywhere in the hierarchy

# Some editors make backups themselves. We don't want to back them up, too


# Other miscellaneous stuff:
# No temporary binary files
# No binary output from compilations

Note that the file uses shell globbing, not regular expressions!

This would be a great time to check the volume of the content we're going to add. I wouldn't recommend backing up more than 2 Gig in a git repo. Now we need to add content to our repository.

~$ git-add -v . > git-add.log

This will take a while. You can monitor git's progress with

~$ tail -f git-add.log

Watch the log and see if there's some content you don't want.

Now git knows that it has to keep track of those files you added. However, we still need to commit the changes to make it permanent. Just to be sure, check the size of your additions:

du -chs $(xargs -0 < git-add.log | cut -d \' -f 2)
[.. long list of files ..]
891M total

You probably shouldn't keep track of more than about 2 Gigs.

If you're content, you can commit:

~$ git-commit -a -m "Initial Commit"

If you don't specify a commit message with -m, git will present you with an editor to enter a commit message. git will not let you commit without a message, it's very religious about this. This might take some time. Get yourself a cup of coffee and relax.

Backing up

To back up, create another git repository, for example on a server or an external drive:

$ git-init

Then add this repository to git as a remote 'origin' to git - a location it knows and can put files on.

~$ git-remote add backup /media/external_drive/backups

This adds a 'location' to gits memory. You can also specify a host using ssh:

~$ git-remote add backup ssh://user@host:/path/to/backups

Git is smart and will use ssh key auth, so you can add this to a cron job if you open your keys in an agent first.

Now use git-push to easily put your home folder on another location:

~$ git-push --all backup

The --all flag tells git to copy all branches

Multi-Branch Setup

TODO: write me!

Using git When You Need to Recover

So you've deleted an important file? No problem. cd to your home directory. If the file is on your last commit, you can just do a checkout:

~$ git-checkout path/to/file

This will also work if you just messed up the file and want to start from scratch.

But it has been gone for a while? No problem, you can specify a commit to checkout:

git-checkout HEAD~3 path/to/file

This tells git to go back 3 commits and use the file it finds there and restore it.

You know you've put some files on your computer, but where did they end up? How are they called?

~$ git-whatchanged --since "yesterday"
~$ git-whatchanged --since "2 weeks ago"
~$ git-whatchanged --since "1 hour"

There you go. You can also fire up gitk (make sure to install it first) - graphical tool to browse through your commit history!

Something most people forget about git: while it uses those funny SHA1 hashes in order to keep track of commits, you can always replace them with entities referring to a particular time span. So, you can say something like

~$ git-grep some_pattern path/to/file "@{two weeks ago}" # or 2 minutes ago, or one hour ago...
~$ git-diff path/to/file "@{2}" # where '2' here means 'the last two commits'

Using rsync to Keep Your Data Safe

You can use one single command to do this!

rsync -vax --numeric-ids --bwlimit=800 --ignore-existing --size-only /path/to/backup

Putting it Together

TODO: Write me! Meanwhile, see the Talk Page