Backups are a technology or process that everyone -- everyone! -- needs to consider. This article looks at some on-line backup options for Linux that can apply to the spectrum of home to enterprise-class users.
Last week’s column featured a discussion about using Linux software RAID (md) to provide additional data protection in addition to backups. Md can be used for RAID-1 (mirroring) or RAID-5 or RAID-6 to give extra data protection in the event of the loss of a drive. But the primary data protection scheme for almost all situations is a backup. This is true for home systems or laptops, up to very large server farms (there are some questions about doing backups on very large size storage systems but that’s a subject for another day).
This article focuses a bit more on backups for home users and Small and Medium Business (SMB). In particular, this article looks at the concept of on-line backups. That is, backups to a storage farm on the Internet. However, this discussion is not limited to home users and SMB’s but can be applied to most any company or storage need.
Taxonomy of Backups
It is well beyond the scope of this article to discuss the fundamentals of system backups, but for the sake of consistency, let’s do a quick review of backups. A backup is a simple concept of making a copy of data that can then be used to either restore a complete file system in the event that it crashes, or to restore individual files in the event that files are erased (accidentally or otherwise). The complexity behind backups comes from the details of how this process occurs and how it it managed.
While Wikipedia is not the most authoritative source of information, the basic concepts of backups are explained very well and serve as a good starting point for explaining the details of backups including management. It breaks down backups into a easily understood hierarchy:
- It starts by discussing “Data Repository Models” which are a description of “how” the data is stored. For example,
- Full + incremental (arguably the most common)
- Continuous data protection
- A discussion of “Storage Media” for backups
- Magnetic tape
- Floppy drives (for the older crowd)
- Hard drives
- Optical drives
- Solid-state drives
- Remote (on-line) storage
- Techniques for managing the data repository
- Off-site, off-line vault
- Backup site of disaster recovery (DR) site
- Then there is a discussion about selecting and manipulating data that is to be backed up including how to handle live data and metadata
- Copying files
- Partial files
- Filesystem dump
- Identification of changes
- Versioning file system
- There is a discussion about the manipulation of data and dataset optimization (this is where the concept of de-duplication comes in). It discusses the concepts of:
- Finally there is a section that talks about managing the backup process. It discusses such topics as:
- Recovery Point Objective (RPO)
- Recovery Time Objective (RTO)
- Data security
- Limitations (what limitations are there in the overall process):
- Backup window
- Performance impact
- Costs of hardware, software, labor
- Network Bandwidth
- Chain of trust
- Measuring the process (making sure the backup scheme worked)
- Backup validation
- Monitored Backup
The discussion is very thorough and logical. It is well worth reading even if you are doing backups today.
This article will focus on the backup media from the above taxonomy of backups. In particular it will focus on on-line (remote) backups.
On-line Storage Concepts
The idea of storing backups on-line is really not a difficult one. The fundamental idea is to store data, either full backups, or incremental data, on a remote storage server that is located somewhere on the Internet. You can then use this stored data in whatever manner you see fit. For example, it could be used for restoring systems in the event of a storage failure, or restoring individual files that have been erased. As long as the system requiring the data has access to the Internet, the data can be retrieved.
One additional concept for on-line backups is sharing files. One user can copy their files to the on-line backup and then other users, sharing the same backup, can download the files as needed. You can also extend this concept to something called synchronization. That is, making sure you have the latest files on several of your systems (i.e. the same data is on all of the systems). For example, you can use an on-line backup to upload your important files from your desktop and then download them to your laptop when you are on the road.
Since storage is not free, on-line backups are commercial enterprises with varying cost structures. There are really two main types of backup services
- Services that use standard Linux tools such as rsync
- Services that use a custom tool, sometimes web-based
Which approach is better depends upon you, your preferences, your processes, and the cost structure you are comfortable with.
On-line Backup Services Using Standard Linux Tools
The most common tool used for this type of on-line backup is rsync. Rsync is a file copying tool that can be used on local hosts or across remote hosts using a remote shell or a rsync daemon. A key feature of rsync is that it can send just the differences between two files rather than the entire file, saving bandwidth and time. This makes rsync a very popular tool for backups.
One service that is available that uses rsync is “rsync to Amazon’s S3″ by S3rsync.com. Amazon has developed a “cloud” storage product called Amazon S3 (Simple Storage Service). S3 is used by a number of companies because of the relatively low cost nature and also because it can be expanded at any time without having to buy hardware. The cost structure of S3 is based on three aspects:
- Capacity (how much are you storing)
- Data transfer (how much data is transferred in and out)
- Requests (getting and putting data)
The rates are also based on monthly usage patterns.
While S3 sounds really great there a couple of caveats to it’s usage. First, S3 does not have a guarantee about not losing data but given the size and scope of S3, losing data is not likely. In addition, to access S3 you need to use Amazon’s tools so they are not easily integrated into existing processes and tools. Moreover, the S3 storage protocol allows you to only ‘get” or “put” data so you cannot modify a file that is stored in S3. If you modify even 1 bit in the file, you have to download the entire file, modify it, and upload the entire file again, leading to extra cost and burning up bandwidth between S3 and your servers.
S3rsync.com has developed a simple way to integrate rsync with S3. The basic concept is to use rsync to S3rsync.com’s servers which then use S3 for their storage. What R3sync.com has done is to put their rsync servers inside the Amazon EC2 facility to allow easy access to S3. All you need is rsync (comes with every Linux distribution) and ssh. In addition there is currently a 10GB limit on buckets (see the S3 site for more information on buckets).
As of the writing of this article, S3rsync.com charges $0.05 per hour in addition to the standard S3 charges.
Another on-line backup service that uses rsync is rsync.net. It is a very popular service, primarily because of it’s support (i.e. engineers man the phones for the most part). Currently, rsync.net supports:
- It also supports certain commands using ssh
- You can mount the file system using sshfs
- An encryption tool called duplicity that uses rsync but also performs GPG encryption in the process
- Remote subversion repositories
- Remote data dumps of MySQL and PostgreSQL databases
Rsync.net also does not have a traffic charge so you can move as much data into and out of the file system as you need. Also, rsync.net allows multiple users to share one account (great for work groups and/or file synchronization).
Pricing for rsync.net as of this writing starts at $1.20/GB per month up to about 50GB ($60/month). Above 50GB there is a discount. If you want to have redundant copies of the data in two different locations, the price is $2.10/GB per month for up to 50GB’s ($110/month).
On-line Backup Services Using Custom Tools
The second class of backup services uses a custom tool or is web based. Some of them may also take a different approach in that the files that are to be backed up have to be in the same directory (this approach works better for work groups and file synchronization). A key feature of this class is that many of them offer free backups up to a certain limit (not a bad way to test the backup services).