Difference between revisions of "Incremental and Static Backups with git and rsync"

From FachschaftSprachwissenschaft
Jump to: navigation, search
(Restructured everything to make more sense and added a housekeeping section.)
Line 9: Line 9:
  
 
While rsync is generally the best method for backing up your static content, people disagree on the choice of VCS to use for revision control. There are also solutions that only use rsync's incremental backup capabilities. Feel free to add documentation and/or links to resources if you think they might be valuable to others.
 
While rsync is generally the best method for backing up your static content, people disagree on the choice of VCS to use for revision control. There are also solutions that only use rsync's incremental backup capabilities. Feel free to add documentation and/or links to resources if you think they might be valuable to others.
 +
 +
If you are very interested in this topic, you can have a look at the resources of [http://vcs-home-madduck.net the vcs-home group] and also subscribe to their mailing list.
  
 
== What you need ==
 
== What you need ==

Revision as of 11:36, 30 July 2008

This is the documentation of my own backup system I installed after I had a disk crash (from which I recovered due to sheer luck, because the disk didn't crash fully and I had some fairly recent backups from a system upgrade).

Please feel free to add suggestions and additional methods to this page if they involve extending the methods presented here. If you have your own way of backing up files, add them at the Backups section.

This article covers the usage of git and rsync with regard to maintaining a backup of your system. While it helps to be familiar with both tools, we will try to give a comprehensive explanation of the techniques you might want to use.

While rsync is generally the best method for backing up your static content, people disagree on the choice of VCS to use for revision control. There are also solutions that only use rsync's incremental backup capabilities. Feel free to add documentation and/or links to resources if you think they might be valuable to others.

If you are very interested in this topic, you can have a look at the resources of the vcs-home group and also subscribe to their mailing list.

What you need

You'll need a few things in place before you can start:

  • Some UNIX-like operating system. This guide does not cover Windows. (For Windows, there is a rather fancy rsync-based backup solution presented by the German computer magazine c't: rsyncbackup.vbs)
  • git, a cron daemon and rsync installed (packaged in virtually all distributions.)
  • A computer, obviously. I'm using a Laptop for day-to-day commuting and work at the university and a desktop computer at home as a server and workstation. My system has the positive side-effect of keeping my laptop and desktop in sync, so I can seamlessly switch working places.
  • Some external method of saving your backup. Backups are supposed to be on an entirely different machine, kept at a different place from your working environment. Different place, as in, another house, town, country, continent. Why? Fires, burglars, republicans and natural disasters. It'll happen. To you.

Different Data, Different Backups

Using two entirely different approaches for backup on the same system might seem odd at first. My choice of doing this depended on my the right tool for the right job credo. For some data, I want to store it not only once, but track it as it changes. But for other data that doesn't frequently change, this doesn't make sense. So what are the different possibilities we have here?

Types of data

You will have to think about three different categories your data can fall into:

  • Frequently changing data
  • Mostly static data
  • Data easily recoverable (like downloaded versions of programs or entire image files) and therefore not worth backing up

Dynamic Storage: Content Under Revision Control

Dynamic data is volatile content that frequently changes - the content you typically do your work on. Code, papers, web pages, homeworks, writings, personal data, etc... This content can be kept track of, as it changes over time, since it actually does change over time. And you know it: damn, that thing used to work only yesterday! Or: I know I had a nice formulation for this last Monday, why did I delete it? It happens!

Static Storage: Data

What happens to all your papers, binaries, downloaded files, music and image collections, videos, etc...? First, you have to decide what you actually want to keep track of. Downloaded data doesn't seem to be the right thing to clobber up your free space with, as you can fetch it off the Net anytime. Your image collection on the other hand, is important, so you will want to back it up.


Layout of Your Home Directory

It is not a requirement, but it helps to have a clean home directory, where the folders are sorted according to their role of keeping one specific type of data. This way you will keep your setup easy and simple - and it usually has the added benefit of helping you finding your stuff more easily. Note that it's not tragic if data you don't want to keep track of makes it into your static backup folders. It is also not tragic if static data makes it into the dynamic folders. But the other way round it's not safe: you should keep important data where it belongs! My home directory is laid out the following way:

~/:
* bin   -- scripts and programs I wrote myself and put here so they are added to
           my $PATH dat -- what I work on. Java exercises, Semantics homework, Jobs..
           code is kept in src/
- etc   -- various content
* html  -- my public_html folder, with a nicer name. Primarily for testing and
           writing web pages using my local apache server.
x mail  -- local mail repository. You won't have that if you don't use fetchmail
           or similar to organize your email locally
- music -- the obvious 
x opt   -- optional  installations that need to be in the
           user directory, like Eclipse
* src   -- most precious folder. All the source's in it.
x temp  -- temporary files
x var   -- volatile content like downloads I don't mind
           losing

The Symbols in front of the name stand for the different backup strategies used for them: *: git -: rsync x: not backed up

This is in no way how you should do it; it's only here so you can understand how the rest of the backup strategy works. You will want to come up with your own hierarchy, tailored to your needs.

Using git to Track Your Work

Revision control systems are typically aimed at source code, but git is exceptionally well suited for binary files, too. This means, you can also store open office documents or other binary formats you work in. As another benefit, git is also very fast and space/time efficient. Another benefit of git that might not be immediately apparent is its distributed nature. This way you can keep local branches for different machines and manage a master content branch that is shared between your systems.

Single Branch Setup

You'll prefer a single branch setup if you only work on one machine and thus don't need to keep track of different configuration files and/or different content.

Initializing the Repository

Let us begin with the easier task of creating our first branch. You will want to have only one branch if you work on your data only at one computer and back up to a non-interactive medium, like a NAS/external drive or a hosted backup solution.

To initialize your git environment, say

git-init

Then you need to do some setup. Add your name and email to git's config:

git-config --global user.name "Jane Doe"
git-config --global user.email jane@example.com

The --global flag tells git to use these data for every git project you start or commit to. If you decide to track one of your projects with git and share it with others, they will know who you are and how to contact you. Besides, we're just making you familiar with git ;-)

Next, you need to tell git what to keep track of, and what to exclude. Edit the file ~/.gitignore:

## Everything you put in here, will be ignored. Globbing allowed, / is
git's root, not your file system's root

# ignore folders with contents we do not care about. We don't care about stuff
# we want to use rsync to back up.
/var
/music
/mail
/opt
/etc
/Desktop
/temp
/dat/uni/binary_files

# A prefixed ! will exclude a certain pattern from being matched for ignorance:
!/var/something_important

# You will probably want to ignore most if not all of the dotfiles in your
# home. Most of them are just not worth keeping track of:
/.*

# But some files may contain valuable configuration!
!/.vimrc
!/.vim
!/.zshrc
!/.muttrc
!/.Xdefaults
!/.emacs

# We do not want to keep track of keeping-track-of systems git doesn't know
# about!  Note how those paths do not carry an absolute prefix. This means that
# they will be matched anywhere in the hierarchy
.svn
CVS

# Some editors make backups themselves. We don't want to back them up, too

#Emacs:
.#* 
*~
#Vim:
.*.swp
.*.swo

# Other miscellaneous stuff:
# No temporary binary files
a.out
# No binary output from compilations
*.class
*.so
*.o

Note that the file uses shell globbing, not regular expressions! If you want to understand the format more thoroughly, please read the gitigore man page (type man gitignore).

This would be a great time to check the volume of the content we're going to add. To do this, we need to add content to our repository.

git-add -v . > git-add.log

This will take a while. You can monitor git's progress with

tail -f git-add.log

Watch the log and see if there's some content you don't want.

Now git knows that it has to keep track of those files you added. However, we still need to commit the changes to make it permanent. Just to be sure, check the size of your additions:

du -chs $(xargs -0 < git-add.log | cut -d \' -f 2)
[.. long list of files ..]
891M total

You probably shouldn't keep track of more than about 2 Gigs.

If you're content, you can commit:

git-commit -a -m "Initial Commit"

If you don't specify a commit message with -m, git will present you with an editor to enter a commit message. git will not let you commit without a message, it's very religious about this. This might take some time. Get yourself a cup of coffee and relax.

Congratulations! You now have a complete copy of your home directories' most important files inside a git repository.

Subsequent Commits

Whenever you want to add files to the indexes, type

git-add -v .
git-commit

This first adds all untracked or tracked but changed files to the commit schedule (the -v flag tells git to echo the files it's adding) and then commits this schedule to the index. Note that in git adding and committing are two different concepts. Just because you've added some files to the index that doesn't mean git's already keeping track of them. It just knows that you want it to keep track of them, but doesn't do so until you really commit (you'll learn to love this behaviour once you start using git for your own projects).

Multi-Branch Setup

TODO: write me!

You'll only need multi-branch setups if you have more than one machine you're working on. Unfortunately, I'm not yet content with my own setup, thus I can't really write it up... Hopefully I'll do so at a later time ;-) If you have ideas about it, please add them here.

Using git When You Need to Recover

So you've deleted an important file? No problem. If the file is on your last commit, you can just do a checkout:

git-checkout path/to/file

This will also work if you just messed up the file and want to start from scratch.

But it has been gone for a while? No problem, you can specify a commit to checkout:

git-checkout HEAD~3 path/to/file

This tells git to go back 3 commits and use the file it finds there and restore it. Or you can give it a timespec:

git-checkout "@{two days ago}" path/to/file

You know you've put some files on your computer, but where did they end up? How are they called?

git-whatchanged --since "yesterday"
git-whatchanged --since "2 weeks ago"
git-whatchanged --since "1 hour"

There you go. You can also fire up gitk (make sure to install it first) - a graphical tool to browse through your commit history!

You can also use git-grep to search for a regular expression in your document's contents:

git-grep some_pattern

or maybe you want to know if there's been something in a particular file two days ago?

git-grep some_pattern path/to/file "@{two days ago}"

Backing Up

To back up, create another git repository, for example on a server or an external drive:

git-init

Then add this repository to git as a remote 'origin' to git - a location it knows and can put files on.

git-remote add backup /media/external_drive/backups

This adds a 'location' to gits memory. You can also specify a host using ssh:

git-remote add backup ssh://user@host:/path/to/backups

Git is smart and will use ssh key auth, so you can add this to a cron job if you open your keys in an agent first.

Now use git-push to easily put your home folder on another location:

git-push --all backup

The --all flag tells git to copy all branches. You don't need it for a single-branch setup but you will want to use it for multiple branches.

Housekeeping

This sections will introduce some advanced git-hackery to you. Generally you shouldn't need to refer to this section often -- it's here for providing a reference in case something goes wrong.

Compressing Your History

git is tailored towards efficiency and speed. Most of the usual actions therefore just add to its index instead of restructuring it. Thus the index will cruft after a while and you need to compress it. For this, you can use

git-gc

This can take up quite an amount of time and resources, so you might want to put it into a separate cron job. Note that all git-push commands actually compress the index automatically. Thus you don't need to run it manually. So, if you're in any way responsible, you shouldn't need to run this manually :-). Sometimes you will want to use one of git-gc's options for some more arcane work.

Pruning Your History

Sometimes it just may happen that you (or your automatic scripts) commited some big files to the index you don't really want to track. This may impact your repository's size and the speed of operations running on it badly. Deleting files from the index is a rather advanced git sorcery and is usually not needed in the environments git was designed for: source code revision control. Please avoid using this technique except as a very last resort. And please do read the git-filter-branch man page before using it.

git-filter-branch --tree-filter 'rm -rf path/to/files' HEAD

This can take a considerable amount of time to complete, depending on the size of your index and history. Be sure to run git-gc before running this.

But you will probably notice that your repository (located in ~/.git) is still the same size! This is because git has been designed by paranoid people (which is good!) who don't want you to loose any data. After you've rewritten your whole history, git still keeps the old refs around for another 30 days by default, after which the references become subject to removal by git-gc --prune.

If you want to, you can delete them right away by issuing

git reflog expire --all --expire-unreachable=0
git gc --prune

Note that this will also kill all references to branches that have been otherwise expired (like through git-commit --amend or so).

Using rsync to Keep Your Data Safe

You can use one single command to do this!

rsync -vax --numeric-ids --bwlimit=800 --ignore-existing --size-only /path/to/backup

You might want too look at the full list of options.

Putting it Together

TODO: Write me! Meanwhile, see the Talk Page