Ceph: The Distributed File System Creature from the Object Lagoon

Did you ever see one of those terrible Sci-Fi movies involving a killer Octopus? Ceph, while named after just such an animal, is not a creature about to eat an unlucky Spring Breaker, but a new parallel distributed file system. The client portion of Ceph just went into the 2.6.34 kernel so let's learn a bit more about it.

The last two years have seen a large number of file systems added to the kernel with many of them maturing to the point where they are useful, reliable, and in production in some cases. In the run up to the 2.6.34 kernel, Linus recently added the Ceph client. What is unique about Ceph is that it is a distributed parallel file system promising scalability and performance, something that NFS lacks.

High-level view of Ceph

One might ask about the origin of Ceph since it is somewhat unusual. Ceph is really short for Cephalopod which is the class of moulluscs to which the octopus belongs. So it’s really short for octopus, sort of. If you want more detail, talk a look at the Wikipedia article about Ceph. Now that name has been partially explained, let’s look at the file system.

Ceph was started by Sage Weil for his PhD dissertation at the University of California, Santa Cruz in the Storage Systems Research Center in the Jack Baskin School of Engineering. The lab is funded by the DOE/NNSA involving LLNL (Lawrence Livermore National Labs), LANL (Los Alamos National Labs), and Sandia National Laboratories. He graduated in the fall of 2007 and has kept developing Ceph. As mentioned previously, his efforts have been rewarded with the integration of the Ceph client into the upcoming 2.6.34 kernel.

The design goals of Ceph are to create a POSIX file system (or close to POSIX) that is scalable, reliable, and has very good performance. To reach these goals Ceph has the following major features:

  • It is object-based
  • It decouples metadata and data (many parallel file systems do this as well)
  • It uses a dynamic distributed metadata approach

These three features and how they are implemented are at the core of Ceph (more on that in the next section).

However, probably the most fundamental core assumption in the design of Ceph is that large-scale storage systems are dynamic and there are guaranteed to be failures. The first part of the assumption, assuming storage systems are dynamic, means that storage hardware is added and removed and the workloads on the system are changing. Included in this assumption is that it is presumed there will be hardware failures and the file system needs to adaptable and resilient.

More in Depth

With the general view of Ceph in mind, let’s dive down into some more details to understand how it’s implemented and what it means. Below in Figure 1 is an overview of the layout of Ceph.

Figure 1: System layout of Ceph.

There are client nodes (the happy smiling faces), a metadata cluster, and the object storage cluster where the data is stored. When a client wants to open a file, it contacts the metadata cluster, that is referred to as the MDS, or MetaData Server, which is in fact a cluster. The MDS returns information to the client that tells it what it’s capabilities are (what it can and cannot do), file size, striping information (the data is striped across multiple storage devices for performance reasons), and something called a file inode (used by Ceph). Once the data is received the client sends/receives data directly from the Object Storage Devices (OSD’s) which make up the data storage cluster. During the data transactions the MDS is checked to see if there are any changes. If there are none, then everything proceeds normally. If there are changes the MDS notifies the client and the OSD’s. One everything is done and the close request is sent to the MDS and OSD’s to close the file, the the client updates the MDS with any details and the MDS marks the file as closed and updates the metadata information.

Object-Based Storage
The system layout serves as a guide for further discussing the details and features of Ceph. One of the first features that is important is to be explained is the object-based approach of the file system. In an object-based file system, the data is broken into objects that are assigned an object ID number and a small amount of metadata and then sent for storage on the Object Storage Devices (OSD’s). The file system metadata for that file then consists of a number of object ID’s that define all of the data as well as other information about the file (e.g. access/modify dates, etc.). Typically, the metadata does not know know precisely where the file is located and relies on the OSD’s for the storage and retrieval of the actual data. The OSD takes care of the lower-level functions itself (kind of a “smart” hard drive if you will). The file system interacts with the OSD’s at a high-level requesting the object itself or information about the object rather than asking for a range of inodes or blocks or something similar.

While there have only been experimental OSD drives the typical way of creating an OSD is to use a middle layer of software between the object based file system and the file system on the drive itself (or even the drive itself). In this approach the drive is just a regular hard drive such as those we currently use. Typically the OSD middle layer converts the object request into a file system request on the underlying drive.

Initially Ceph used something called EBOFS (Extent and B-tree based Object File System) but support was dropped in mid-2009. It was replaced with btrfs which promises to give as good or better performance than EBOFS. In addition, btrfs has a few features that EBOFS does not. Namely,

  • Copy-on-write semantics for file data (who doesn’t like a COW?)
  • Well maintained and tested (it’s in the kernel and under heavy development)

In addition, according to the Ceph wiki,

"... To avoid reinventing the wheel, Ceph will use btrfs on individual storage nodes (OSDs) to store object data, and we will focus on adding any additional functionality needed to btrfs where it will hopefully benefit non-Ceph users as well. ..."

For example, there is a recent patch that adds some features to btrfs that help Ceph.

Distributed Metadata
Another key aspect of Ceph that distinguishes itself from other file systems is that it uses something Sage terms “Dynamic Distributed Metadata Management.” The first keyword is distributed meaning multiple metadata servers unlike Lustre which only has one metadata server. Being distributed means that the lose of a metadata server (MDS) won’t cause the entire file system to crash.

The second keyword in the title is Dynamic. This means that the metadata can actually be moved or redistributed from one MDS to another. If a MDS goes down or is added, portions of the file system directory hierarchy are moved to better balance performance and capacity. This distribution is based on the workload but preserves locality in each MDS’s workload improving performance because the metadata can be aggressively prefetched.

Dynamic metadata also means that over time the metadata is redistributed to make better use of resources including load balancing for systems that don’t even add storage hardware. So if a certain part of the directory tree was used more often than others, it can either be divided across MDS nodes or consolidated to a single MDS coupled with aggressive caching.

Reliability through Replication
Typical file systems, even distributed parallel ones, rely on data storage units that have RAID or SAN fail-over mechanisms to help maintain data access. This also includes redundant power supplies, possibly redundant RAID controllers, redundant network cards, and other costly hardware solutions. An example of this is Lustre. On the opposite of this approach is Ceph that uses replication to help maintain access to data. Ceph maintains copies of data across the OSD’s to ensure that the loss of any OSD or multiple OSD’s will not cause the loss of data. If an OSD is lost the objects that it contained are on other OSD’s and are immediately copied to other remaining OSD’s so that the proper number of copies is maintained. The copies are spread out so that no “hot spots” develop in the replication process and as much replication as possible takes place in parallel.

Using replication does mean that you use more capacity to store the same data but it also means that you don’t need parity disks or “spare” disks making 100% use of all the storage in the OSD’s. It also means that you don’t develop hot spots in the OSD’s waiting for a RAID rebuild. Moreover, since you don’t need to do a RAID rebuild you don’t need the compute power, saving money and electrical power.

Distributed Object Storage
One way to achieve better performance is to stripe data across multiple OSD’s (something like RAID-0). Ceph does this and uses replication to ensure that the lose of an OSD does not mean that the data is lost. The component of Ceph that does this is called RADOS (Reliable Autonomic Distributed Object Store). Figure 2 below presents how the data from a file is broken into objects and distributed to the OSD’s.

Figure 2: Ceph Distributed Object Storage.

A file is broken into objects and then these objects are mapped into placement groups (PG’s) using a simple hash function. Then the placement groups are assigned to OSD’s using a component of Ceph called CRUSH (Controlled Replication Under Scaling Hashing). CRUSH is a pseudo-random data distribution function that efficiently maps each PG to an ordered list of OSD’s where copies of the object are stored. One feature of CRUSH is that it is a globally known function so any component of Ceph (client, MDS, OSD) can compute the location of an object. This means that you don’t have to involve the MDS to compute the location of an object.

Relaxation of POSIX (sort of)
Ceph uses the phrase “near-POSIX” because it has the ability to relax some of the POSIX semantics to improve performance (see the recent article POSIX IO Must Die!). In particular it uses a subset of a proposed set of extensions for POSIX for HPC (High-Performance Computing).

A classic example illustrating why extensions are needed for POSIX is that when a file is opened by multiple clients (usually happens in HPC) where each client has either multiple writers or a mix of readers and writers, the metadata server will revoke any read caching and write buffering capabilities to make sure that all clients access the data correctly. This forces the client IO to suddenly become synchronous and the performance drops tremendously particularly for small files (POSIX is at least enforcing consistency – always good). However, some applications already know that they don’t have consistency issues because of the design of the application (this is common in HPC applications) but they have to suffer a severe performance penalty because POSIX has chosen to trust no one – even if the application is correct because each writer or reader works on an independent part of the file.

The proposed POSIX extensions have options to address this issue as well as others. In particular, there is an option O_LAZY that is used for an open() syscall that explicitly relaxes coherency for a shared-write file. It assumes that the application is managing it’s own coherency. As previously mentioned, in HPC many applications can read/write to a single file from many processes since each process works on an independent part of the file. Using the O_LAX option means that the applications can run at higher speeds using caching and buffering that POSIX normally allows.


Ceph has a number of features which make it very attractive for the growing file systems we all are experiencing. It has designed for scalability, reliability, and performance. At the same time is assumes that hardware will fail or have hardware added, so it has a design that can adapt to these situations. Ceph breaks the file system into two pieces: (1) metadata, and (2) data. This allows each piece to be designed in the most efficient manner to achieve these three goals of Ceph.

Ceph uses a dynamic distributed metadata server (MDS) that is not only clustered but also adapts to the changing workload. It will automatically distribute portions of the hierarchical directory tree to other MDS servers in the cluster to better load balance as the workload changes. In addition, if a MDS server is added, it will move portions of the metadata to that new box, again, better distributing the load.

The concept of replication is used along with Object Server Devices (OSD’s) so that all the space on all the drives is used (no parity drives, no spare drives). During the writing of an object to Ceph, it is automatically replicated to other OSD’s so that the loss of an OSD(s) won’t result in the loss of data. If an OSD is lost, the objects are again re-replicated so that the number of copies of the objects is maintained.

While the Ceph client was recently include in the 2.6.34 kernel (it was in a “rc” version where rc = release candidate) it is still considered not ready for prime-time. It also uses btrfs as the underlying storage mechanism for the OSD’s and btrfs itself is still in development. But including the client in the kernel does three things. First, it gives a vote of confidence to Ceph. Second, since it’s in the kernel it should get some more “development eyes” examining the code. And third, it should get more testing.

If you’re feeling “experimental” or have an upcoming need for larger amounts of storage, then give Ceph a try. It’s really not a scary octopus about to eat your boat.

Comments on "Ceph: The Distributed File System Creature from the Object Lagoon"

I went over this web site and I conceive you have a lot of wonderful info, saved to bookmarks (:.

Here is a great Blog You might Find Intriguing that we encourage you to visit.

Below you?ll find the link to some web-sites that we feel you ought to visit.

The information mentioned in the write-up are a few of the ideal available.

Always a huge fan of linking to bloggers that I adore but really don’t get a lot of link appreciate from.

That could be the end of this article. Right here you?ll find some web-sites that we assume you will appreciate, just click the hyperlinks.

Here is a superb Blog You might Come across Interesting that we encourage you to visit.

One of our visitors not long ago proposed the following website.

Please stop by the websites we follow, like this one, as it represents our picks in the web.

Here are some hyperlinks to websites that we link to mainly because we think they may be worth visiting.

The data mentioned in the post are several of the best accessible.

Wonderful story, reckoned we could combine some unrelated data, nonetheless genuinely worth taking a appear, whoa did one particular study about Mid East has got additional problerms also.

Below you?ll locate the link to some web sites that we feel you’ll want to visit.

Always a big fan of linking to bloggers that I adore but really don’t get lots of link love from.

The time to study or pay a visit to the content or web pages we’ve linked to below.

Check below, are some totally unrelated sites to ours, nonetheless, they’re most trustworthy sources that we use.

Usually posts some very exciting stuff like this. If you are new to this site.

The time to study or take a look at the material or websites we’ve linked to beneath.

Here are some links to web-sites that we link to for the reason that we assume they are worth visiting.

Although websites we backlink to below are considerably not associated to ours, we really feel they’re actually really worth a go via, so possess a look.

That will be the finish of this write-up. Right here you will uncover some websites that we believe you?ll enjoy, just click the links.

Here are a few of the websites we suggest for our visitors.

The time to read or check out the content material or web pages we have linked to below.

We like to honor numerous other internet internet sites on the internet, even though they aren?t linked to us, by linking to them. Beneath are some webpages really worth checking out.

Every when in a even though we pick out blogs that we study. Listed beneath are the latest internet sites that we pick out.

We came across a cool web-site which you may possibly enjoy. Take a look if you want.

The data talked about inside the write-up are a few of the top offered.

Here are some hyperlinks to sites that we link to since we believe they may be worth visiting.

Wonderful story, reckoned we could combine several unrelated data, nonetheless actually really worth taking a look, whoa did one particular find out about Mid East has got extra problerms too.

The time to study or check out the material or web-sites we’ve linked to below.

Here are several of the web sites we advocate for our visitors.

Here is an excellent Weblog You might Come across Exciting that we encourage you to visit.

We came across a cool website that you simply may well get pleasure from. Take a look when you want.

Would you be concerned with exchanging hyperlinks?

Usually posts some very intriguing stuff like this. If you?re new to this site.

One of our visitors not long ago recommended the following website.

The time to read or visit the content material or internet sites we’ve linked to below.

Very couple of internet websites that transpire to become detailed beneath, from our point of view are undoubtedly nicely worth checking out.

Always a large fan of linking to bloggers that I enjoy but don?t get a great deal of link like from.

Wonderful story, reckoned we could combine some unrelated data, nevertheless actually really worth taking a look, whoa did one study about Mid East has got far more problerms too.

Wonderful story, reckoned we could combine several unrelated data, nonetheless really worth taking a search, whoa did one learn about Mid East has got far more problerms also.

Here is a superb Weblog You might Discover Intriguing that we encourage you to visit.

Always a large fan of linking to bloggers that I really like but don?t get a good deal of link enjoy from.

Very few web-sites that transpire to become in depth beneath, from our point of view are undoubtedly well really worth checking out.

Below you?ll obtain the link to some web-sites that we assume it is best to visit.

That would be the end of this write-up. Right here you will discover some web sites that we feel you will enjoy, just click the links.

Wonderful story, reckoned we could combine a couple of unrelated data, nonetheless seriously worth taking a search, whoa did one understand about Mid East has got extra problerms at the same time.

One of our visitors recently suggested the following website.

Please pay a visit to the web-sites we adhere to, such as this 1, as it represents our picks in the web.

Leave a Reply