One Billion Dollars! Wait… I Mean One Billion Files!!!
The world is awash in data. This fact is putting more and more pressure on file systems to efficiently scale to handle increasingly large amounts of data. Recently, Ric Wheeler from Redhat experimented with putting 1 Billion files in a single file system to understand what problems/issues the Linux community might face in the future. Let's see what happened...
Awash in a sea of data
No one is going to argue that the amount of data we generate and want to keep is growing at an unprecedented rate. In a 2008 article blogger Dave Raffo highlighted some statistics from an IDC model of enterprise data growth rate, that unstructured data was increasing at about 61.7% CAGR (Compounded Annual Growth Rate). In addition, data in the cloud (Google, Facebook, etc.) was expected to increases at a rate of 91.8% through 2012. These are astonishing growth rates that are causing file system developers to either bite their finger nails to the quick or for them to start thinking about some fairly outlandish file system requirements.
As an example, on lwn, a poster mentioned that a single MRI instrument can produce 20,000 files in a single scan. In about 9 months they had already produced about 23 million files from a single MRI instrument.
Individual’s are taking digital picture with just about everything they own with cell phone images being the most popular. These images get uploaded to desktop and laptops, and, hopefully, end up on backup drives. Many of these images are also uploaded to Facebook or flickr or even personal websites. I know of a friend’s daughter who just started college and already has over 15,000 pictures of which a majority are on Facebook. With a family of 4, each taking 5,000-10,000 pictures a year, you can easily generate 20,000-40,000 files per year. Then you throw in email, games, papers, Christmas cards, music and other sources of data, a family can easily generate 1 million files a year on a family desktop or NAS server.
So far we’ve been able to store this much data because 2TB drives are very common, and 3TB drives are right around the corner. These can be used to create storage arrays that easily hit the 100TB mark with relatively little financial strain. Plus we can buy 2-10 of these drives during sales at Fry’s and stuff them in a nice case to give us anywhere from 2TB to 20TB of space just in our home desktop.
There are huge questions surrounding all of this data and it’s storage. How can we search this data? How can we ensure that the data doesn’t become corrupted? (sometimes that means making multiple copies so our storage requirements just doubled). How do we move data to/from our laptops, cell phones, desktops, to a more centrally controlled location? How do we backup all of this data? But perhaps one of the more fundamental questions is, can our storage devices, specifically our file systems, store this much data and still be able to function?
Smaller Scale Testing
Recently, Ric Wheeler from Redhat started doing some experiments with file systems to understand their limitations in regard to scale. In particular, he wanted to try loading up Linux file systems with 1 billion files as an experiment and see what happened.