Jeff Layton talks with Theodore Ts'o about getting the best performance out of your file system, painless migration and the work still to do.
While you can read the on-line documentation and articles about ext4, you can gain some important perspective by going directly to the horse’s mouth. Jeff Layton talks with Theodore Ts’o to talk about designing ext4, painless migration and the work still to do.
Jeff Layton What original design goals did you have for ext4?
Theodore Ts’o There were a number of features that we’ve wanted to add to ext3 — to improve performance, support larger number of blocks, etc. — that we couldn’t without breaking backwards compatibility, and which would take a long enough time to stabilize that we couldn’t make those changes to the existing ext3 code base. Some of those features included: extents, delayed allocation, the multiblock allocator, persistent preallocation, metadata checksums, and online defragmentation.
Along the way we added some other new features that were easy to add, and didn’t require much extra work, such as NFSv4 compatible inode version numbers, nanosecond timestamps, and support for running without a journal — which was a feature that was important to Google, and which was contributed by a Google engineer. This wasn’t something we were planning on adding, but the patch was relatively straightforward, and it meant Google would be using ext4 and providing more testing for ext4, so it was a no-brainer to add that feature to ext4.
JL What design goals were not met in the current version of ext4? Why did they not make it into the current version?
TT The biggest thing which is not yet done on the kernel side is online defragmentation. Unfortunately, that work was contributed by developers who were relatively new to ext2/3/4 filesystem development, and the patches had a number of problems. They’ve been rewritten, and they are a lot better, but the patches are still not quite mainline ready. Hopefully we’ll be able to get those patches whipped into shape and merged in the near future, though.
The other piece which is still not quite finish is support for large block numbers. The kernel code is there, and it’s been lightly tested, but the e2fsprogs support for 64-bit block numbers has not yet been merged into mainline. Patches exist, but I haven’t had the time I’ve needed to do the necessary Q/A and to get those patches merged into e2fsprogs’ mainline. Again, that should hopefully happen soon.
JL Ext4 can be used as an upgrade path for ext3. Was this one of the top design goals and was there any consideration given to something completely new and not interoperable with ext3?
TT One of our primary design goals was that it should be painlessly easy to upgrade from ext3 to ext4. You might not get all of the benefits of ext4 unless you do a backup/reformat/restore of your filesystem, but you would get at least some of the benefits by simply remounting the filesystem using ext4 and enabling some of ext4′s features.
We didn’t really consider doing something competely new and totally incompatible with ext3 because part of the goal of ext4 was to have something that could be stablized fairly quickly. The reality is that it takes years before a completely new filesystem to be considered stable enough for use in an enterprise environment, and we wanted something that could be ready as quickly as possible.
Besides, there are other efforts, such as btrfs, which are starting from scratch, and btrfs will have new features, such as filesystem-level snapshots, and the equivalent of dynamic inode tables, that ext4 could never have because we wanted to stick to the tried-and-true ext3 design — with enhancements, to be sure, but in many critical ways ext4 doesn’t really deviate from ext3 in that we still use ext3′s physical block journaling, ext3′s fixed inode table, and bitmaps for inode and block allocation.
You can think of ext4 as being an exercise to see how much a tried-and-true filesystem design could be stretched while still retaining the fundamental BSD-style FFS architecture.
JL In some of the article I have read on the web, there is some mention about being able to modify ext4 in the future for 64-bits to go above the 1 EB range. While not committing yourself to any comments ala’ Bill Gates and 640KB of memory, do you think it’s possible we’ll see a need for 64-bit block addressing? Would this become ext5?
TT Ext4′s kernel code supports 48 bit block numbers; using 4k block sizes, that gives a maximum filesystem size of 1 Exabyte. One of the reasons why we decided to stick with this was out of consideration of the Clusterfs folks, who contributed the extents and delayed allocation code. Since they have customers using Lustre that utilize this format, we decided to keep on-disk compatibility so that Lustre users could easily migrate their server filesystems to use ext4.
It would be relatively easy to add an alternate extent format to support 64-bit block numbers, and we may end up doing that at some point. The e2fsprogs code was written to easily support multiple extent formats; the kernel code is less flexible, but if this were to become an issue, we could add this support easily enough. I wouldn’t really consider this “ext5″; it would probably just be an additional feature for ext4.
JL Is there any thought being given to adapting ext4 to SSDs? If so, what concepts are being thrown around?
TT I’ve actually written a whole series of blog posts on this subject, which you can see here
Part of the problem right now is that SSD’s are still under going major changes. For example, if you are using Intel’s new SSD’s, the X25-M and X25-E, pretty much no changes seem to be necessary.
Ext4 has support for the ATA TRIM command, which allows filesystems to inform SSD’s that blocks have been deleted and do not need to be taken into account by the SSD’s garbage collection and wear-leveling algorithms. Unfortunately the ATA TRIM command hasn’t been finalized yet, and so (as of today) there are no drives, including Intel’s SSD’s that actually support the ATA TRIM command; and for this reason Linux’s block device layer does not currently issue the ATA TRIM command, since there haven’t been any devices to test the command. So at the moment, ext4 informs the block layer that blocks that belong to deleted files can be discard, so once TRIM-capable SSD’s become available, and the Linux block layer actually sends the TRIM command to the hard drives, everything will be all set to go.
However, even without TRIM support, the X25-M SSD works very well on ext4 today. I have one installed in my laptop, and it works just fine. Unfortunately, older SSD’s do not work so well on ext2/3/4. It will be interesting to see how well the next generation of SSD’s work on ext4. For example, I expect SanDisk and OCZ to both release new SSD’s fairly soon. Both of these SSD manufacturers haven’t stated how their new SSD’s will compare to Intel’s SSD offerings, but hopefully they will have comparable features. If so, it may not be worth it to try to optimize ext4 for “legacy” SSD’s. Time will tell….
JL What’s left to be done with ext4 and the supporting utilities?
TT The big thing that’s left to be done is the online resize and 48-bit block number support in e2fsprogs.
Next: Ted’s Performance Tips