Jeff Bonwick's Blog

Monday Nov 02, 2009

ZFS Deduplication

You knew this day was coming: ZFS now has built-in deduplication.

If you already know what dedup is and why you want it, you can skip the next couple of sections. For everyone else, let's start with a little background.

What is it?

Deduplication is the process of eliminating duplicate copies of data. Dedup is generally either file-level, block-level, or byte-level. Chunks of data -- files, blocks, or byte ranges -- are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77 or, in more familiar notation, 0.00000000000000000000000000000000000000000000000000000000000000000000000000001. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.

Chunks of data are remembered in a table of some sort that maps the data's checksum to its storage location and reference count. When you store another copy of existing data, instead of allocating new space on disk, the dedup code just increments the reference count on the existing data. When data is highly replicated, which is typical of backup servers, virtual machine images, and source code repositories, deduplication can reduce space consumption not just by percentages, but by multiples.

What to dedup: Files, blocks, or bytes?

Data can be deduplicated at the level of files, blocks, or bytes.

File-level assigns a hash signature to an entire file. File-level dedup has the lowest overhead when the natural granularity of data duplication is whole files, but it also has significant limitations: any change to any block in the file requires recomputing the checksum of the whole file, which means that if even one block changes, any space savings is lost because the two versions of the file are no longer identical. This is fine when the expected workload is something like JPEG or MPEG files, but is completely ineffective when managing things like virtual machine images, which are mostly identical but differ in a few blocks.

Block-level dedup has somewhat higher overhead than file-level dedup when whole files are duplicated, but unlike file-level dedup, it handles block-level data such as virtual machine images extremely well. Most of a VM image is duplicated data -- namely, a copy of the guest operating system -- but some blocks are unique to each VM. With block-level dedup, only the blocks that are unique to each VM consume additional storage space. All other blocks are shared.

Byte-level dedup is in principle the most general, but it is also the most costly because the dedup code must compute 'anchor points' to determine where the regions of duplicated vs. unique data begin and end. Nevertheless, this approach is ideal for certain mail servers, in which an attachment may appear many times but not necessary be block-aligned in each user's inbox. This type of deduplication is generally best left to the application (e.g. Exchange server), because the application understands the data it's managing and can easily eliminate duplicates internally rather than relying on the storage system to find them after the fact.

ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system. Block-level dedup also maps naturally to ZFS's 256-bit block checksums, which provide unique block signatures for all blocks in a storage pool as long as the checksum function is cryptographically strong (e.g. SHA256).

When to dedup: now or later?

In addition to the file/block/byte-level distinction described above, deduplication can be either synchronous (aka real-time or in-line) or asynchronous (aka batch or off-line). In synchronous dedup, duplicates are eliminated as they appear. In asynchronous dedup, duplicates are stored on disk and eliminated later (e.g. at night). Asynchronous dedup is typically employed on storage systems that have limited CPU power and/or limited multithreading to minimize the impact on daytime performance. Given sufficient computing power, synchronous dedup is preferable because it never wastes space and never does needless disk writes of already-existing data.

ZFS deduplication is synchronous. ZFS assumes a highly multithreaded operating system (Solaris) and a hardware environment in which CPU cycles (GHz times cores times sockets) are proliferating much faster than I/O. This has been the general trend for the last twenty years, and the underlying physics suggests that it will continue.

How do I use it?

Ah, finally, the part you've really been waiting for.

If you have a storage pool named 'tank' and you want to use dedup, just type this:

zfs set dedup=on tank

That's it.

Like all zfs properties, the 'dedup' property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis.

What are the tradeoffs?

It all depends on your data.

If your data doesn't contain any duplicates, enabling dedup will add overhead (a more CPU-intensive checksum and on-disk dedup table entries) without providing any benefit. If your data does contain duplicates, enabling dedup will both save space and increase performance. The space savings are obvious; the performance improvement is due to the elimination of disk writes when storing duplicate data, plus the reduced memory footprint due to many applications sharing the same pages of memory.

Most storage environments contain a mix of data that is mostly unique and data that is mostly replicated. ZFS deduplication is per-dataset, which means you can selectively enable dedup only where it is likely to help. For example, suppose you have a storage pool containing home directories, virtual machine images, and source code repositories. You might choose to enable dedup follows:

zfs set dedup=off tank/home

zfs set dedup=on tank/vm

zfs set dedup=on tank/src

Trust or verify?

If you accept the mathematical claim that a secure hash like SHA256 has only a 2^-256 probability of producing the same output given two different inputs, then it is reasonable to assume that when two blocks have the same checksum, they are in fact the same block. You can trust the hash. An enormous amount of the world's commerce operates on this assumption, including your daily credit card transactions. However, if this makes you uneasy, that's OK: ZFS provies a 'verify' option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not. To enable this variant of dedup, just specify 'verify' instead of 'on':

zfs set dedup=verify tank

Selecting a checksum

Given the ability to detect hash collisions as described above, it is possible to use much weaker (but faster) hash functions in combination with the 'verify' option to provide faster dedup. ZFS offers this option for the fletcher4 checksum, which is quite fast:

zfs set dedup=fletcher4,verify tank

The tradeoff is that unlike SHA256, fletcher4 is not a pseudo-random hash function, and therefore cannot be trusted not to collide. It is therefore only suitable for dedup when combined with the 'verify' option, which detects and resolves hash collisions. On systems with a very high data ingest rate of largely duplicate data, this may provide better overall performance than a secure hash without collision verification.

Unfortunately, because there are so many variables that affect performance, I cannot offer any absolute guidance on which is better. However, if you are willing to make the investment to experiment with different checksum/verify options on your data, the payoff may be substantial. Otherwise, just stick with the default provided by setting dedup=on; it's cryptograhically strong and it's still pretty fast.

Scalability and performance

Most dedup solutions only work on a limited amount of data -- a handful of terabytes -- because they require their dedup tables to be resident in memory.

ZFS places no restrictions on your ability to dedup. You can dedup a petabyte if you're so inclined. The performace of ZFS dedup will follow the obvious trajectory: it will be fastest when the DDTs (dedup tables) fit in memory, a little slower when they spill over into the L2ARC, and much slower when they have to be read from disk. The topic of dedup performance could easily fill many blog entries -- and it will over time -- but the point I want to emphasize here is that there are no limits in ZFS dedup. ZFS dedup scales to any capacity on any platform, even a laptop; it just goes faster as you give it more hardware.

Acknowledgements

Bill Moore and I developed the first dedup prototype in two very intense days in December 2008. Mark Maybee and Matt Ahrens helped us navigate the interactions of this mostly-SPA code change with the ARC and DMU. Our initial prototype was quite primitive: it didn't support gang blocks, ditto blocks, out-of-space, and various other real-world conditions. However, it confirmed that the basic approach we'd been planning for several years was sound: namely, to use the 256-bit block checksums in ZFS as hash signatures for dedup.

Over the next several months Bill and I tag-teamed the work so that at least one of us could make forward progress while the other dealt with some random interrupt of the day.

As we approached the end game, Matt Ahrens and Adam Leventhal developed several optimizations for the ZAP to minimize DDT space consumption both on disk and in memory, key factors in dedup performance. George Wilson stepped in to help with, well, just about everything, as he always does.

For final code review George and I flew to Colorado where many folks generously lent their time and expertise: Mark Maybee, Neil Perrin, Lori Alt, Eric Taylor, and Tim Haley.

Our test team, led by Robin Guo, pounded on the code and made a couple of great finds -- which were actually latent bugs exposed by some new, tighter ASSERTs in the dedup code.

My family (Cathy, Andrew, David, and Galen) demonstrated enormous patience as the project became all-consuming for the last few months. On more than one occasion one of the kids has asked whether we can do something and then immediately followed their own question with, "Let me guess: after dedup is done."

Well, kids, dedup is done. We're going to have some fun now.

Monday Dec 31, 2007

How to Lose a Customer

For over a year I have been the proud and happy owner of a Garmin GPS unit -- the Nuvi 360.  I have practically been a walking billboard for the company.  Go ahead, ask me about my Nuvi!

That changed today, permanently.  When I powered on the Nuvi this morning, it alerted me that its map database was over a year old and should be updated.  That makes sense, I thought -- indeed, how nice of them to remind me!  So I brought the Nuvi inside, plugged it into my Mac, and went to Garmin's website to begin the update.

Wait a minute, what's this?  They want to charge $69 for the update!  Excuse me?  This isn't new functionality I'm getting, it's a bug fix.  The product I bought is a mapping device.  Its maps are now "out of date", as Garmin puts it -- well, yes, in the same way that the phlogiston theory is "out of date".  The old maps are wrong, which means that the product has become defective and should be fixed.  Given the (somewhat pathetic) fact that the Nuvi doesn't automatically update its maps from Web or satellite sources, the least Garmin could do to keep their devices operating correctly in the field is provide regular, free fixes to the map database.  I didn't buy a GPS unit so I could forever navigate 2005 America.

But wait, it gets better.

You might imagine that getting the update would require supplying a credit card number to get a license key, downloading the map update, and then using the key to activate it.  Nope!  You have to order a physical DVD from Garmin, which takes 3-5 weeks to ship.  3-5 weeks!  Any reason they can't include a first-class postage stamp as part of the $69 shakedown?  And seriously, if you work for Garmin and you're reading this, check out this cool new technology.  It really works.  Swear to God.  You're soaking in it.

Assuming you ordered the DVD, you would not discover until after it arrived -- because this is mentioned nowhere on Garmin's website -- that the DVD will only work for one device.  Yes, that's right -- after going to all the trouble to get a physical copy of the map update, you have to get on their website to activate it, and it's only good for one unit.  So to update my wife's unit as well as my own, I'd have to order two DVDs, for $138.  That's offensive.  Even the RIAA doesn't expect me to buy two copies of every CD just because I'm married.  And the only reason I know about this is because I checked Amazon first, and found many reviewers had learned the hard way and were livid about it.  Garmin's policy is bad, but their failure to disclose it is even worse.

Moreover, the 2008 map update isn't a one-time purchase.  There's an update every year, so it's really a $138/year subscription.  That's $11.50/month.  For maps.  For a mapping device.  That I already paid for.

What does one get for this $11.50/month map subscription?  According to the reviews on Amazon, not much.  Major construction projects that were completed several years ago aren't reflected in the 2008 maps, and Garmin still hasn't fixed the long-standing bug that any store that's part of a mall isn't in their database.  (Want to find the nearest McDonald's?  No dice.  You just have to know that the nearest McDonald's is in the XYZ Shopping Center, and ask for directions to that.  This is really annoying in practice.)

I can get better information from Google maps, continuously updated, with integrated real-time traffic data, for free, forever -- and my iPhone will happily use that data to plot time-optimal routes.  (In fact, all the iPhone needs is the right antenna and a SIRF-3 chipset to make dedicated GPS devices instantly obsolete.  This is so obvious it can't be more than a year out.  I can live with the stale maps until then, and have a $138 down payment on the GPS iPhone earning interest while I wait.)

And so, starting today, that's exactly what I'll do.

I don't mind paying a reasonable fee for services rendered.  I do mind getting locked into a closed-source platform and being forced to pay monopoly rents for a proprietary, stale and limited version of data that's already available to the general public.  That business model is so over.

Everything about this stinks, Garmin.  You tell me, unexpectedly, that I have to pay for routine map updates.  You make the price outrageous.  You don't actually disclose what's in the update.  (Several Amazon reviewers say the new maps are actually worse.)  You make the update hard to do.  You needlessly add to our landfills by creating single-use DVDs.  You have an unreasonable licensing policy.  And you hide that policy until after the purchase.

Way to go, Garmin.  You have pissed off a formerly delighted customer, and that is generally a one-way ticket.  You have lost both my business and my respect.  I won't be coming back.  Ever.

Thursday Apr 12, 2007

A Near-Death Experience

Evidently, my previous post was just a tad too cheerful for some folks' taste.  But I speak with the optimism of a man who has cheated death.  And ironically, Pete's reference to George Cameron had a lot to do with it.

Several years ago, George and a few other Sun folks went off to form 3par, a new storage company.  They all had Solaris expertise, and understood its advantages, so they wanted to use it inside their box.  But we weren't open-source at the time, and our licensing terms really sucked.  Both of us -- George at 3par, and me at Sun -- tried for months to arrange something reasonable.  We failed.  So finally -- because Sun literally gave them no choice -- 3par went with Linux.

I couldn't believe it.  A cool new company wanted to use our product, and instead of giving them a hand, we gave them the finger.

For many of us, that was the tipping point.  If we had any reservations about open-sourcing Solaris, that ended them.  It was a gamble, to be sure, but the alternative was certain death.  Even if the 3par situation had ended differently, it was clear that we needed to change our business practices.  To do that, we'd first have to change our culture.

But cultures don't change easily -- it usually takes some traumatic event.  In Sun's case, watching our stock shed 95% of its value did the trick.  It was that total collapse of confidence -- that near-death experience -- that opened us up to things that had previously seemed too dangerous.  We had to face a number of hard questions, including the most fundamental ones: Can we make a viable business out of this wreckage?  Why are we doing SPARC?  Why not AMD and Intel?  Why Solaris?  Why not Linux and Windows?  Where are we going with Java?  And not rah-rah why, but really, why?

In each case, asking the question with a truly open mind changed the answer.  We killed our more-of-the-same SPARC roadmap and went multi-core, multi-thread, and low-power instead.  We started building AMD and Intel systems.  We launched a wave of innovation in Solaris (DTrace, ZFS, zones, FMA, SMF, FireEngine, CrossBow) and open-sourced all of it.  We started supporting Linux and Windows.  And most recently, we open-sourced Java.  In short, we changed just about everything.  Including, over time, the culture.

Still, there was no guarantee that open-sourcing Solaris would change anything.  It's that same nagging fear you have the first time you throw a party: what if nobody comes?  But in fact, it changed everything: the level of interest, the rate of adoption, the pace of communication.  Most significantly, it changed the way we do development.  It's not just the code that's open, but the entire development process.  And that, in turn, is attracting developers and ISVs whom we couldn't even have spoken to a few years ago.  The openness permits us to have the conversation; the technology makes the conversation interesting.

After coming so close to augering into the ground, it's immensely gratifying to see the Solaris revival now underway.  So if I sometimes sound a bit like the proud papa going on and on about his son, well, I hope you can forgive me.

Oh, and Pete, if you're reading this -- George Cameron is back at Sun now, three doors down the hall from me.  Small valley!

Tuesday Apr 10, 2007

Solaris Inside

When you choose an OS for your laptop, many things affect your decision: application support, availability of drivers, ease of use, and so on.

But if you were developing a storage appliance, what would you want from the operating system that runs inside it?

The first thing you notice is all the things you don't care about: graphics cards, educational software, photoshop... none of it matters. What's left, then?  What do you really need from a storage OS? And why isn't Linux the answer?  Well, let's think about that.

You need something rock-solid, so it doesn't break or corrupt data.

You need something that scales, so you can take advantage of all those cores the microprocessor folks will be giving you.

You need really good tools for performance analysis, so you can figure out how to make your application scale as well as the OS does.

You need extensive hardware diagnostic support, so that when parts of the box fail or are about to fail, you can take appropriate action.

You need reliable crash dumps and first-rate debugging tools so you can perform first-fault diagnosis when something goes wrong.

And you need a community of equally serious developers who can help you out.

OpenSolaris gives you all of these: a robust kernel that scales to thousands of threads and spindles; DTrace, the best performance analysis tool on the planet; FMA (Fault Management Architecture) to monitor the hardware and predict and manage failures; mdb to analyze software problems; and of course the OpenSolaris community, a large, vibrant, professional, high signal-to-noise environment.

The other operating systems one might consider are so far behind on so many of these metrics, it just seems like a no-brainer.

Let's put it this way: if I ever leave Sun to do a storage startup, I'll have a lot of things to think about.  Choosing the OS won't be one of them.  OpenSolaris is the ideal storage development platform.

The General-Purpose Storage Revolution

It happened so slowly, most people didn't notice until it was over.

I'm speaking, of course, of the rise of general-purpose computing during the 1990s.  It was not so long ago that you could choose from a truly bewildering variety of machines.  Symbolics, for example, made hardware specifically designed to run Lisp programs.  We debated SIMD vs. MIMD, dataflow vs. control flow, VLIW, and so on.  Meanwhile, those boring little PCs just kept getting faster.  And more capable.  And cheaper.  By the end of the decade, even the largest supercomputers were just clusters of PCs. A simple, general-purpose computing device crushed all manner of clever, sophisticated, highly specialized systems.

And the thing is, it had nothing to do with technology. It was all about volume economics.  It was inevitable.

With that in mind, I bring news that is very good for you, very good for Sun, and not so good for our competitors:  the same thing that happened to compute in the 1990s is happening to storage, right now. Now, as then, the fundamental driver is volume economics, and we see it playing out at all levels of the stack: the hardware, the operating system, and the interconnect.

First, custom RAID hardware can't keep up with general-purpose CPUs. A single Opteron core can XOR data at about 6 GB/sec.  There's just no reason to dedicate special silicon to this anymore.  It's expensive, it wastes power, and it was always a compromise: array-based RAID can't provide the same end-to-end data integrity that host-based RAID can. No matter how good the array is, a flaky cable or FC port can still flip bits in transit.  A host-based RAID solution like RAID-Z in ZFS can both detect and correct silent data corruption, no matter where it arises.

Second, custom kernels can't keep up with volume operating systems. I try to avoid naming specific competitors in this blog -- it seems tacky -- but think about what's inside your favorite storage box. Is it open source?  Does it have an open developer community? Does it scale?  Can the vendor make it scale?  Do they even get a vote?

The latter question is becoming much more important due to trends in CPU design.  The clock rate party of the 1990s, during which we went from 20MHz to 2GHz -- a factor of 100 -- is over.  Seven years into the new decade we're not even 2x faster in clock rate, and there's no sign of that changing soon.  What we are getting, however, is more transistors.  We're using them to put multiple cores on each chip and multiple threads on each core (so the chip can do something useful during load stalls) -- and this trend will only accelerate.

Which brings us back to the operating system inside your storage device. Does it have any prayer of making good use of a 16-core, 64-thread CPU?

Third, custom interconnects can't keep up with Ethernet.  In the time that Fibre Channel went from 1Gb to 4Gb -- a factor of 4 -- Ethernet went from 10Mb to 10Gb -- a factor of 1000.  That SAN is just slowing you down.

Today's world of array products running custom firmware on custom RAID controllers on a Fibre Channel SAN is in for massive disruption. It will be replaced by intelligent storage servers, built from commodity hardware, running an open operating system, speaking over the real network.

You've already seen the first instance of this: Thumper (the x4500) is a 4-CPU, 48-disk storage system with no hardware RAID controller. The storage is all managed by ZFS on Solaris, and exported directly to your real network over standard protocols like NFS and iSCSI.

And if you think Thumper was disruptive, well... stay tuned.

Thursday Jan 11, 2007

Out of the mouths of babes...

After sizing up the computers we have at home, my son Andrew made the following declaration: "I want Solaris security, Mac interface, and Windows compatibility."  Age 10.  Naturally, sensing a teachable moment, I explained to him what virtualization is all about -- bootcamp, Parallels, Xen, etc.  And the thing is, he really gets it.  I can't wait to see what his generation is capable of.

Thursday Sep 16, 2004

Welcome

Welcome aboard! I'm Jeff Bonwick, a Distinguished Engineer (la-de-da!) at Sun. I'm guessing you're here because you recently read about ZFS.

Let me begin with a note of thanks.

According to Sun's website staff, the ZFS article has generated the highest reader response ever -- thank you! The ZFS team gets to see all the feedback you provide, so please keep it coming. I'll respond to some of the more interesting comments in this blog.

My favorite comment thus far was a caustic remark about 128-bit storage, which will be the subject of the next post...


<!--Languages--!> <!-- / Languages--!>
Archives
Languages:
Links
Referrers