ZFS Deduplication : Jeff Bonwick's Blog

Monday Nov 02, 2009

ZFS Deduplication

You knew this day was coming: ZFS now has built-in deduplication.

If you already know what dedup is and why you want it, you can skip the next couple of sections. For everyone else, let's start with a little background.

What is it?

Deduplication is the process of eliminating duplicate copies of data. Dedup is generally either file-level, block-level, or byte-level. Chunks of data -- files, blocks, or byte ranges -- are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77 or, in more familiar notation, 0.00000000000000000000000000000000000000000000000000000000000000000000000000001. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.

Chunks of data are remembered in a table of some sort that maps the data's checksum to its storage location and reference count. When you store another copy of existing data, instead of allocating new space on disk, the dedup code just increments the reference count on the existing data. When data is highly replicated, which is typical of backup servers, virtual machine images, and source code repositories, deduplication can reduce space consumption not just by percentages, but by multiples.

What to dedup: Files, blocks, or bytes?

Data can be deduplicated at the level of files, blocks, or bytes.

File-level assigns a hash signature to an entire file. File-level dedup has the lowest overhead when the natural granularity of data duplication is whole files, but it also has significant limitations: any change to any block in the file requires recomputing the checksum of the whole file, which means that if even one block changes, any space savings is lost because the two versions of the file are no longer identical. This is fine when the expected workload is something like JPEG or MPEG files, but is completely ineffective when managing things like virtual machine images, which are mostly identical but differ in a few blocks.

Block-level dedup has somewhat higher overhead than file-level dedup when whole files are duplicated, but unlike file-level dedup, it handles block-level data such as virtual machine images extremely well. Most of a VM image is duplicated data -- namely, a copy of the guest operating system -- but some blocks are unique to each VM. With block-level dedup, only the blocks that are unique to each VM consume additional storage space. All other blocks are shared.

Byte-level dedup is in principle the most general, but it is also the most costly because the dedup code must compute 'anchor points' to determine where the regions of duplicated vs. unique data begin and end. Nevertheless, this approach is ideal for certain mail servers, in which an attachment may appear many times but not necessary be block-aligned in each user's inbox. This type of deduplication is generally best left to the application (e.g. Exchange server), because the application understands the data it's managing and can easily eliminate duplicates internally rather than relying on the storage system to find them after the fact.

ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system. Block-level dedup also maps naturally to ZFS's 256-bit block checksums, which provide unique block signatures for all blocks in a storage pool as long as the checksum function is cryptographically strong (e.g. SHA256).

When to dedup: now or later?

In addition to the file/block/byte-level distinction described above, deduplication can be either synchronous (aka real-time or in-line) or asynchronous (aka batch or off-line). In synchronous dedup, duplicates are eliminated as they appear. In asynchronous dedup, duplicates are stored on disk and eliminated later (e.g. at night). Asynchronous dedup is typically employed on storage systems that have limited CPU power and/or limited multithreading to minimize the impact on daytime performance. Given sufficient computing power, synchronous dedup is preferable because it never wastes space and never does needless disk writes of already-existing data.

ZFS deduplication is synchronous. ZFS assumes a highly multithreaded operating system (Solaris) and a hardware environment in which CPU cycles (GHz times cores times sockets) are proliferating much faster than I/O. This has been the general trend for the last twenty years, and the underlying physics suggests that it will continue.

How do I use it?

Ah, finally, the part you've really been waiting for.

If you have a storage pool named 'tank' and you want to use dedup, just type this:

zfs set dedup=on tank

That's it.

Like all zfs properties, the 'dedup' property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis.

What are the tradeoffs?

It all depends on your data.

If your data doesn't contain any duplicates, enabling dedup will add overhead (a more CPU-intensive checksum and on-disk dedup table entries) without providing any benefit. If your data does contain duplicates, enabling dedup will both save space and increase performance. The space savings are obvious; the performance improvement is due to the elimination of disk writes when storing duplicate data, plus the reduced memory footprint due to many applications sharing the same pages of memory.

Most storage environments contain a mix of data that is mostly unique and data that is mostly replicated. ZFS deduplication is per-dataset, which means you can selectively enable dedup only where it is likely to help. For example, suppose you have a storage pool containing home directories, virtual machine images, and source code repositories. You might choose to enable dedup follows:

zfs set dedup=off tank/home

zfs set dedup=on tank/vm

zfs set dedup=on tank/src

Trust or verify?

If you accept the mathematical claim that a secure hash like SHA256 has only a 2^-256 probability of producing the same output given two different inputs, then it is reasonable to assume that when two blocks have the same checksum, they are in fact the same block. You can trust the hash. An enormous amount of the world's commerce operates on this assumption, including your daily credit card transactions. However, if this makes you uneasy, that's OK: ZFS provies a 'verify' option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not. To enable this variant of dedup, just specify 'verify' instead of 'on':

zfs set dedup=verify tank

Selecting a checksum

Given the ability to detect hash collisions as described above, it is possible to use much weaker (but faster) hash functions in combination with the 'verify' option to provide faster dedup. ZFS offers this option for the fletcher4 checksum, which is quite fast:

zfs set dedup=fletcher4,verify tank

The tradeoff is that unlike SHA256, fletcher4 is not a pseudo-random hash function, and therefore cannot be trusted not to collide. It is therefore only suitable for dedup when combined with the 'verify' option, which detects and resolves hash collisions. On systems with a very high data ingest rate of largely duplicate data, this may provide better overall performance than a secure hash without collision verification.

Unfortunately, because there are so many variables that affect performance, I cannot offer any absolute guidance on which is better. However, if you are willing to make the investment to experiment with different checksum/verify options on your data, the payoff may be substantial. Otherwise, just stick with the default provided by setting dedup=on; it's cryptograhically strong and it's still pretty fast.

Scalability and performance

Most dedup solutions only work on a limited amount of data -- a handful of terabytes -- because they require their dedup tables to be resident in memory.

ZFS places no restrictions on your ability to dedup. You can dedup a petabyte if you're so inclined. The performace of ZFS dedup will follow the obvious trajectory: it will be fastest when the DDTs (dedup tables) fit in memory, a little slower when they spill over into the L2ARC, and much slower when they have to be read from disk. The topic of dedup performance could easily fill many blog entries -- and it will over time -- but the point I want to emphasize here is that there are no limits in ZFS dedup. ZFS dedup scales to any capacity on any platform, even a laptop; it just goes faster as you give it more hardware.

Acknowledgements

Bill Moore and I developed the first dedup prototype in two very intense days in December 2008. Mark Maybee and Matt Ahrens helped us navigate the interactions of this mostly-SPA code change with the ARC and DMU. Our initial prototype was quite primitive: it didn't support gang blocks, ditto blocks, out-of-space, and various other real-world conditions. However, it confirmed that the basic approach we'd been planning for several years was sound: namely, to use the 256-bit block checksums in ZFS as hash signatures for dedup.

Over the next several months Bill and I tag-teamed the work so that at least one of us could make forward progress while the other dealt with some random interrupt of the day.

As we approached the end game, Matt Ahrens and Adam Leventhal developed several optimizations for the ZAP to minimize DDT space consumption both on disk and in memory, key factors in dedup performance. George Wilson stepped in to help with, well, just about everything, as he always does.

For final code review George and I flew to Colorado where many folks generously lent their time and expertise: Mark Maybee, Neil Perrin, Lori Alt, Eric Taylor, and Tim Haley.

Our test team, led by Robin Guo, pounded on the code and made a couple of great finds -- which were actually latent bugs exposed by some new, tighter ASSERTs in the dedup code.

My family (Cathy, Andrew, David, and Galen) demonstrated enormous patience as the project became all-consuming for the last few months. On more than one occasion one of the kids has asked whether we can do something and then immediately followed their own question with, "Let me guess: after dedup is done."

Well, kids, dedup is done. We're going to have some fun now.

Posted at 05:14AM Nov 02, 2009 by bonwick in General | Comments[76]

Comments:

Can't wait to test it...

Posted by Thommy M. Malmström on November 02, 2009 at 06:36 AM PST #

Really great news. While other filesystems try to get at the point where ZFS was yesterday, ZFS moves ahead.

Posted by c0t0d0s0 on November 02, 2009 at 06:50 AM PST #

"When using a secure hash like SHA256, the probability of a hash collision is about 2^-256"

Jeff, what about this?
http://en.wikipedia.org/wiki/Birthday_problem

Posted by max on November 02, 2009 at 06:50 AM PST #

Will dedup speed up copying or moving files from one dataset to another? If yes, will it result in read-only activity on the disks when moving the file or, even better, in only increasing the reference count of the blocks?

Posted by Cristian Yxen on November 02, 2009 at 06:59 AM PST #

Well, now you just need to dedup your time schedule, and you'll have a lot of spare time blocks laying around that you can use for your kids!

Posted by Gregg Wonderly on November 02, 2009 at 07:07 AM PST #

Birthday collisions, having only 365 values, fully utilize less than 15 bits. 256 bit offers substantially more values. Feel free to recalculate the birthday formulas with 2^256 instead of 365 and post the result if you like.

Posted by Bahamat on November 02, 2009 at 07:31 AM PST #

" the probability of a hash collision is about 2^-256 = 10^-77 "

Consider 2^(-256/2) to simplifie.

Posted by bastien on November 02, 2009 at 07:47 AM PST #

Does Dedup also apply to data in the L2ARC? If so... wow o_0

Posted by Ross on November 02, 2009 at 07:48 AM PST #

You mentioned, that probably everyone wants to have the DDTs in memory. So a little formula like zfs_size/blocksize * N ~ DDT size would be helpful as well...

Posted by 79.194.19.161 on November 02, 2009 at 07:51 AM PST #

Bahamat: did you read the link? It includes probabilities for 256 bit hashes. With a 256 bit hash, there's a 1-in-a-billion chance of a collision with 1.5 × 10^34 blocks. How many blocks might there be in a ZFS array?

Posted by Michael S. on November 02, 2009 at 08:35 AM PST #

Excellent news; for starters this should give us the space benefits of a sparse zone with the flexibility of a full root one.

Posted by Dick Davies on November 02, 2009 at 08:40 AM PST #

@max, @bastien -

There is a discussion of the overall chance of hash collision when factoring in the total number of blocks in the ARC thread for a related, but orthogonal case:

http://arc.opensolaris.org/caselog/PSARC/2009/557/mail

Which refers to the following table:

http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table

To have a collision probability of 10^-18 (already more reliable than almost anything else in the system), this would require approximately 2^98 unique blocks (2^115 bytes @128k) to be written, well beyond the limits for any forseeable storage platform.

Posted by Eric Schrock on November 02, 2009 at 08:57 AM PST #

The problem with using a hash function is that attackers control a lot of data on your filesystem on a modern OS. Consider the web cache on a machine where a user browses the web. This allows an attacker a platform to intentionally try to cause collisions that can cause the filesystem to malfunction. These sort of attacks are still hard, but there has been a lot of progress in attack several popular hash functions lately. A solution to prevent this is to use a keyed hash function and keep the key secret.

Posted by newsham on November 02, 2009 at 09:47 AM PST #

Do you have any advice or observations on how dedup interacts with ZFS compression?

Posted by Alan Burlison on November 02, 2009 at 09:52 AM PST #

If the implementation of the SHA256 ( or possibly SHA512 at some point ) algorithm is well threaded then one would be able to leverage those massively multi-core Niagara T2 servers. The SHA256 hash is based on six 32-bit functions whereas SHA512 is based on six 64-bit functions. The CMT Niagara T2 can easily process those 64-bit hash functions and the multi-core CMT trend is well established. So long as context switch times are very low one would think that IO with a SHA512 based de-dupe implementation would be possible and even realistic. That would solve the hash collision concern I would think.

Posted by Dennis Clarke on November 02, 2009 at 09:55 AM PST #

This is great news.
How long until we can actually use this feature in the development builds? Can’t wait to try it out!

Posted by Brian White on November 02, 2009 at 10:23 AM PST #

"Excellent news; for starters this should give us the space benefits of a sparse zone with the flexibility of a full root one."

Unless the deduplication also spills over into the memory management, it doesn't: Running two deduplicated full root zones still requires twice the RAM than running only one of them, while running two sparse zones only require twice the read/write pages, with the read-only pages being shared.

So the questions are: How often does Solaris load a couple of identical pages to memory, when they're deduplicated on disk? Are there plans to get that to "once", if it isn't already?

Real-life scenario: We run 60 sparse zones vs. the 16 full zones that the system could manage before we're out of RAM (4GB installed).
And the difference only grows with more RAM.

Posted by Patrick Georgi on November 02, 2009 at 10:39 AM PST #

Great news!! wtg Jeff & Co!

Does the deduping apply to the ARC/L2ARC as well (ie, only pointers to duplicate blocks reside, rather than the whole block)?

And I assume it works well with compression?

Can't wait to play. :)

Posted by Don MacAskill on November 02, 2009 at 11:30 AM PST #

Any ideas when this will show up in releases of OpenSolaris or Solaris 10?

Posted by Daniel on November 02, 2009 at 11:35 AM PST #

One question I thought of while reading is do different pools recognize the data in the other pool for deduplication. For example, if I have pools A and B (deduplication is enabled in each) and I have an identical block X in both pools, will it be deduplicated? There are pro's and con's to deduplicating across pools and not deduplicating across pools. Which did you choose and will there be options in the future to do the opposite?

Posted by Mark Weber on November 02, 2009 at 11:39 AM PST #

@Don - The ARC work (so that we deduplicate in-core) is forthcoming. Mark Maybee is making good progress on that.

And yes, it works perfectly well with compression - just as you would imagine. :)

Posted by Bill Moore on November 02, 2009 at 11:44 AM PST #

Congrats Jeff,

Another great step for OpenSolaris and OpenStorage.

I'm looking forward to snv_128 on IPS. Is there any way you can hurry the binary images over to IPS?

Posted by Mike La Spina on November 02, 2009 at 11:48 AM PST #

My Apologies,

Congrats Jeff, Bill and all the great team members.

Posted by Mike La Spina on November 02, 2009 at 11:53 AM PST #

newsham: as far as the world's open cryptographic community knows, it is impossible for anyone to generate collisions in SHA-256 even if they are deliberately trying to do so.

Dennis: Hm, can't ZFS use the hardware implementation of SHA-256 in the T2? Anyway, "the hash collision concern" is already solved by SHA-256 -- see Eric Shrock's comment.

all: it seems like there are some funny interactions between dedupe and crypto: http://mail.opensolaris.org/pipermail/zfs-crypto-discuss/2009-November/002947.html . I'm glad to learn from this blog post that Nicolas Williams was wrong to say that dedupe will always require full block comparison.

Posted by Zooko on November 02, 2009 at 11:54 AM PST #

So, is this going to be included in the next update of solaris or are we going to have to download some kind of a kernel patch soon?

Thanks.

Posted by Deniz on November 02, 2009 at 11:55 AM PST #

This is something I always wished. Some people rant about differential copies (copy on write of files) or sparse files, but the problem of these schemes is that they tend to not survive the file being copied on a different disk, or at any rate they work only if the filesystem has knowledge of the original creation of the redundancy. It is not the case here, and so is much more general. Imagine, one can now write a file of one petabyte of zeroes - without the application telling the filesystem (other than by writing said zeroes).

Fun question: would it be possible for a malicious user to try and write blocks that he suspects are the same as blocks from data he's not supposed to be able to read, and figure out if a deduplication occurred by timing the process? Would be an interesting attack. (note: has nothing to with finding a collision with whatever hash function is used, though these attacks are interesting as well; as far as I am aware, so far ZFS was only dependent on the hash function resistance to pre-image attacks, if of course ZFS was supposed to guarantee cryptographic integrity; but now the hash function must also resist collision attacks or fun things might occur...)

One drawback of the current support however: given a huge array such that the dedup table has to spill to disk, which is used in bursts for, say, backup, and since browsing the table (and adding entries to it) has no locality whatsoever (otherwise, it means there is a problem with the hash function, if I'm not mistaken), it will have to hit the disk with the same proportion as the proportion of data not in memory over the total size of the table (or in other words: you cannot efficiently cache that table in memory), then an offline option would have been useful for that use case.

Pierre (still in bargaining stage Mac user)

Posted by Pierre Lebeaupin on November 02, 2009 at 11:56 AM PST #

Congrats to everyone involved!

I'm curious about the claim of increased performance across the board -- doesn't read performance suffer from the transformation from what was once a continuous read into many short reads in different areas of the disk? Or perhaps you have a strategy to deal with that, or don't find this fragmentation to be a problem in practice?

Thanks.

Posted by Chris on November 02, 2009 at 11:59 AM PST #

Well done Jeff. You seem to be involved with all the best work ;)

Re:"One question I thought of while reading is do different pools recognize the data in the other pool for deduplication. "

I doubt that dedup has the ability to be done across pools. Reason a is that it's a ZFS flag to turn it on rather than a zpool one. Reason b is that you can't be sure that the second pool will not be removed or even exported. I prefer it this way.

Posted by dasmo on November 02, 2009 at 12:09 PM PST #

Congrats on this achievement, but I also have a question.

Birthday paradox arguments aside, cryptographic hashes have a long history of being broken. If this ever happens to SHA-256, the Mother of All Remote privilege escalations will immediately apply. E.g.,a collision with a block of the Windows kernel would let *a web site* run privileged code on however-many-hundreds of VMs hang off the corrupted physical block (assuming web caches get written to virtual disk).

The verify flag provides an out here, but it is not the default, and most users will take that hint. Unfortunately, there is no way to know whether you should have used verify until it is too late; if SHA-256 is ever broken, then a non-verify pool is defective.

I realize you all are smart cookies, and have thought much more deeply about this than I have just reading this blog post. So what's the punchline? Are we that much more confident about SHA-256 than its predecessors? Is the performance hit to verify enough to make the system useless for important domains? Some combination of the two?

Posted by Keith Adams on November 02, 2009 at 12:41 PM PST #

@Keith Adams:

It's not _that_ bad. For your exploit to work, you'd need to have the faulty file around _first_.

Deduplication doesn't overwrite existing files with a duplicate, but avoids writing duplicates in the first place. (at least as far as I understand it)

So, the attack vector would require you to know beforehand of a new component that's normally ran with elevanted rights. Then push a file with the same hash to the system, and then wait for deduplication to kick in.
That's a whole lot harder than "hey, I can write a file with the same content as the kernel".

It's possible to scan the hash space: Write all kinds of files, hash with a different hash in RAM and on disk, if they differ, you got a collision with another file.
And then hope for it to be something truly secret ;-)

But given the size of the hash space, that's not a productive use of your time, and it would be too easy to figure out that there's something fishy going on (who's creating and deleting billions of files all the time?)

Posted by Patrick Georgi on November 02, 2009 at 12:49 PM PST #

Oh wow, that dude actually makes sense!

RT
www.complete-privacy.at.tc

Posted by John Woods on November 02, 2009 at 01:17 PM PST #

@zooko: indeed, ZFS dedup does have the option of not verifying block contents when hashes match -- I spoke too quickly. A minor error, I think.

@{Zooko, various others}: The point is that if you don't want to trust the hash function, well, you don't have to.

@Zooko: Back to the ZFS crypto issue that Darren was asking for advice on: by MACing every block in addition to hashing it we don't depend on the hash function's collision resistance for security, though, clearly, for dedup you'll want to enable block contents verification if you do fear attackers that can create hash collisions. IMO, not depending on a hash function's collision resistance, is a good thing.

Posted by Nico on November 02, 2009 at 01:18 PM PST #

Congratulations on rolling this out of the door. Quite an achievement.

A quick&simple suggestion wrt "Trust or verify?" and, specifically, using fast fletcher hash with subsequent positve(s) verification by byte-comparison. There is third option: always look-up by fletcher hash (assuming blocks are indexed by both this AND sha-256), and use sha to verify positives.

This has slight advantage over verification by byte-comparison, especially with multiple positives, and huge - fletcher over sha - win in negative cases. The latter yields a nice property: the less duplicated is the data, the lesser is dedupe's overhead.

Posted by Andrey Kuzmin on November 02, 2009 at 01:34 PM PST #

Where are we going to see it first, Solaris, OpenSolaris or S7000?

Posted by Jacob on November 02, 2009 at 02:22 PM PST #

That's nice an all... but for my purposes, it would be far more of a "win", to allow for cross-zfs(but same pool) hardlinks. And/or mv's.

Not to mention some kind of zfs-aware rsync. For fast, efficient remote replication (or restorals, for that matter!), that does not require keeping a matching "full-filesystem snapshot" around).
Maybe with this dup-detection stuff, you will be closer to having that happen now?

Posted by Philip on November 02, 2009 at 02:45 PM PST #

Fabulous info! Can't wait to try this out on my VM.

Posted by AppGirl on November 02, 2009 at 03:10 PM PST #

What a lot of others asked: Where and when are we, the general public, going to be able to see it and test it ourselves? Your blog speaks as if it's readily available, yet I don't see it in Update 8 of Sol10, nor in recent snapshots of OpenSolaris. Am I missing something?

Thanks

Posted by Wrex on November 02, 2009 at 03:53 PM PST #

I stand corrected:
http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010683.html

So when can we expect it in Solaris 10? :-)

Posted by Wrex on November 02, 2009 at 04:10 PM PST #

The probability of a collision for an equiprobable hash function is ~ 2^(-n/2), where "n" is the size of the output hash in bits. The probability is _not_ 2^-n (that's the probability of a getting a single output, not the probability of two input documents producing the same output which is exactly a collision).

For more information about collisions for cryptographical functions please read about the Birthday attack. For example: http://en.wikipedia.org/wiki/Birthday_attack.

Posted by Felipe Alfaro Solana on November 02, 2009 at 04:16 PM PST #

Congrats, really nice feature!

Did you consider side-channel attacks? Let us assume that an attacker that is a user on a machine knows that somewhere on disk there is a block that contains just a username and a password, say "john:abc"

Now he could write his own combinations "john:aaa", "john:aab", .. ,"john:zzz" each to a unique block in a file on disk and notice that the timing of writing "john:abc" is different to all the others, because the block "john:abc" is deduped?

Posted by Klaus Borchers on November 02, 2009 at 04:40 PM PST #

@Wrex: You should see it in OpenSolaris dev builds (i.e. http://pkg.opensolaris.org/dev) in roughly a month. The current "build" (build 128) closes for code changes on 9 November, then it gets some QA, and then is published. We just pushed build 126 publicly last week.

Posted by Dan Price on November 02, 2009 at 04:57 PM PST #

More thoughts: How about using dedup processing, to then enable synthetic snapshot creation across systems?

Example: two separate solaris systems, both running zfs filesystems that have "mostly similar" content, and need to be kept near-synced in future.
Rather than having to completely blow away and rebuild one, to then have a shared full snapshot for incremental zsends... how about some kind of tool that would create one?
Reasons this would be worthwhile, could be large datasets, and bandwidth-constrained interconnects.

and/or: user error. They were previously kept in sync with zsync, but some admin accidentally deletes the "wrong" snapshot. or all snapshots. That admin is going to have some very very unhappy users, unless there's a nice neat way to regen the common snapshot without long downtimes for rebuild.

Posted by Philip on November 02, 2009 at 05:16 PM PST #

For all the people that will climb down the hole of hash function probability it would be interesting to contrast that with just bit rot on modern drives.

Posted by Laird on November 02, 2009 at 05:19 PM PST #

@Klaus:

Timing it may be unreliable, especially on a busy system but someone could just dtrace it to figure it out. It might make sense to have a nodedup option applied at file level so developers have a chance to adress such concerns for sensitive files.

Posted by John on November 02, 2009 at 05:34 PM PST #

We have no idea on the suitability of our data for dedup. Is there any method available that can scan the data and report on the suitability of switching dedup on?

Posted by pvw on November 02, 2009 at 05:43 PM PST #

Super news!! :D

Posted by Dave on November 02, 2009 at 05:49 PM PST #

@Felipe Alfaro Solana: The "probability of a collision" is NOT about 2^(n/2). That is the approximate number of items one needs to hash with an n-bit hash function for the probability of a collision to be about 50%.

The critical issue, which a few people have touched upon, is that the probability of "a collision" depends on how much data is being hashed. If we have a zetabyte of data and a 128K block size we can end up needing to hash up to 2^53 (about 10^16) blocks. The probability of getting any collisions on a 256 bit hash function with this many tries is about 1 in 2^151 (less than 1 in 10^46). This is dozens of orders of magnitude safer than the underlying disc drives.

Posted by Nicko on November 02, 2009 at 06:11 PM PST #

Re side channel attacks - seems to me they are plausible both by measuring system response time and by measuring available space on the volume(s) after writing the suspect block(s). Even on busy system, it will be possible to identify trends. (e.g., if we write block "X" 1000 times, it is always much faster than if we write block "Y" 1000 times; same for space usage.) These attacks could be mitigated by making the actual response time equal to the expected worst-case scenario, and by limiting "space available" responses to "available under quota" versus "absolute space available", but both approaches would create other side effects such as slower response time(s) and forcing the use of quotas.

It could also be used to identify media files (music, video, whatever) if there's a typical/canonical encoding that's likely to be used.

Might be useful for identifying executable/library/data files installed on a machine (perhaps at the version level to identify vulnerabilities) if the attacker can work with a known example; or for iteratively determining the contents of a sensitive file such as a *nix password file (discussed above by Klaus Borchers) or a file containing a database access password, passphrase, .htaccess file (or password file pointed to by .htaccess file), etc.

I am a fan of the deduplication idea, but I think it has significant security implications that may be tough to identify early.

Posted by Greg Broiles on November 02, 2009 at 06:16 PM PST #

Tanks... I am waiting for this since zfs was made. :)

Posted by vince on November 02, 2009 at 07:51 PM PST #

Jeff,

Congratulations to you and your team! You have reenergized my enthusiasm for Solaris and you have given my business a strong reason go to w/ Sun Solaris in lieu of the competition.

God bless you. Take time for your family.

Joe

Posted by Joseph Kotran on November 02, 2009 at 09:04 PM PST #

I have a large ZFS pool (on a Storage 7000 system) containing virtual machine images. I'm sure the dedup stuff will be very useful for this kind of data. I'm wondering, though, will there be some way of forcing existing data through the algorithm, or will only data that was written after dedup was turned on benefit?

Posted by TA on November 03, 2009 at 12:14 AM PST #

Congratulations for a job well done.
After reading all the comments so far, would you consider the following suggestion to allow the user to select check summing method SHA-256, SHA-512, fletcher4 or some future method.
Enabling the check summing function extensions as a dynamically loadable modules would allow in place dedup improvements

Posted by sophie on November 03, 2009 at 12:58 AM PST #

Excellent work. I have two quick questions:

1. What happens when the DDTs cannot fit in memory? How many extra I/Os will be needed in this case for each application I/O?

2. Given the common case that two blocks are very similar but not identical, how does ZFS handle this?

Posted by Thanos Makatos on November 03, 2009 at 01:18 AM PST #

will there be an asynchronous version in the future??
thanks for the info

Posted by abc on November 03, 2009 at 01:54 AM PST #

Scary is all I can say.

So what happens when you lose a block on the disk that happened to be the reference block for 12 others?

First thing I'd want to know before letting this option anywhere near my production filesystems is how much redundancy is there and how easy is it to configure ?

Posted by sophed on November 03, 2009 at 01:57 AM PST #

Does this open a backup poisening attack ? If i write the same blocks in sequence in one big or several small files, with a sequence just long enough to fool the compression system of the backup, would this enable an attacker to exhaust backup space ?
If the backup space is another deduped ZFS-System, would this enable an attacker to exhaust the communications capability ? If there is a quota on the accounts, this is not a real problem, but with unlimited accounts, it may be.

Off course, most of the time it is not an attacker, but an idiot running untested software, without having a look at it, for extended periods of time.

Can ZFS trigger a warning, if a block happens to be referenced for more than setable number of times ?

Posted by Knut Grunwald on November 03, 2009 at 02:47 AM PST #

Sophed,
There is a mechanism for that. If a block is referenced by 1000 other blocks you have a severe problem if that block corrupts. Therefore you will be able to specify how many references is allowed to a block. Saying something like, "a block can not have more than 10 references" or something similar.

But, ZFS is very secure and for the redudancy you use raidz2 or raidz3, of course. With raidz2 (raid-6) you will get lots of redundancy.

Jeff,
Good work! BTW, jeff, did you know that Solaris is the best OS out there? :o) Truly.

Posted by Kebabbert on November 03, 2009 at 05:27 AM PST #

Dreaming about a BitTorrent client that uses dedup to find chunks on disk before trying to download them.

Posted by 213.208.249.134 on November 03, 2009 at 05:47 AM PST #

Couple of perhaps stupid questions (if so aplogies):

Does the p value for collisions hold true for blocks that only differ in a single bit?

and

Do you/have you empirically tested your implementation in some way to verify collision frequency...?

Posted by 131.111.45.20 on November 03, 2009 at 05:59 AM PST #

When will we see this in Solaris? (not OpenSolaris)

In Solaris 10 next or Solaris +++ (11?)?

--
David Strom

Posted by David Strom on November 03, 2009 at 06:20 AM PST #

Awesome work! Congratulations! Can't wait to try this out. By the way, does this mean that data deduplication software are going to be pushed aside?

Posted by Lalith Suresh on November 03, 2009 at 08:10 AM PST #

"You should see it in OpenSolaris dev builds (i.e. http://pkg.opensolaris.org/dev) in roughly a month. The current "build" (build 128) closes for code changes on 9 November, then it gets some QA, and then is published. We just pushed build 126 publicly last week."

That's great, but OpenSolaris is so GNU/Linux bastardized and changing so fast, that's both illogical and impractical to use it in production.

Let's try the question another way:

will the ZFS deduplication feature make it into Solaris 10 u9?

Posted by UX-admin on November 03, 2009 at 09:44 AM PST #

So I have to ask... Why use an expensive hashing algorithm at all? Why not use a cheaper hash, but use trust + verify before actually performing a deduplication? This will reduce processor load and eliminate the use of collisions in the event two pieces of data hash the same but are in fact different.

On my system, MD5 costs 1/3 of what SHA256 costs in CPU time, and while it may be more likely to cause collisions, ZFS should always do a byte-to-byte comparison on any block before deduping one to make sure they are, in fact, identical.

Posted by Mike on November 03, 2009 at 11:20 AM PST #

@Mike: That's what dedup=fletcher4,verify is for...

Posted by Max on November 03, 2009 at 11:32 AM PST #

For some real world hash collision examples you may want to try using backuppc on your data. It includes deduplication, but runs at a file level.

I was rather surprised to find that in a small system (a few TB), I'm running into quite a few files that have the same hash but different contents.

For example, on one of my backup servers:

# Pool is 2698.23GB comprising 1232264 files and 4369 directories (as of 11/3 10:16),
# Pool hashing gives 2651 repeated files with longest chain 51,

So, I have 3TB of data, and one of the checksums has 51 collisions that were detected (51 different files that had the same hash but different contents).

At first I was surprised that there were any collisions, but then I remembered the birthday problem...

In any case, it seems like "verify" is the most conservative setting, and I'm surprised that the file-system that basically promises to never corrupt your data defaults to a setting in which this could happen.

Sean

Posted by Sean Reifschneider on November 03, 2009 at 11:33 AM PST #

By default, fletcher4 is used when creating new zfs file-systems. Many users like myself have already filled up a bunch of zfs file-systems using fletcher4. What happens if a user tries to enable dedup but doesn't use verify for an existing fletcher4 filesystem? Is there a warning/error?

Are there any plans for asynchronous deduplication at some point?

For the paranoid and performance conscious, I could see wanting to do dedup with verification in a batch job, run at least nightly.

Posted by Garen on November 03, 2009 at 12:06 PM PST #

@Sean Reifschneider
What hash function is being used by software you mentioned?

Posted by Bartek Pelc on November 03, 2009 at 12:08 PM PST #

@Greg: Come to think of it, the easiest exploit would probably be to use the "readahead" features of the system, where not single blocks (say 4K) are read from disk, but larger sectors (say 64K).

Reading sixteen 4k-blocks on an even sector boundary, the first block, if not in cache, will take at least several msecs(disk access time), while the 15 subsequent blocks will be available in usecs.

In order to find out if a block with a certain content exists anywhere on disk, from within, say, a VM, you just write a sector with 15 blocks of random garbage onto the disk, but one block somewhere in the middle, e.g. #9, contains the contents you want to check. After a few minutes to hours, when the data you have written has been evicted from cache in ram, you read back the first block, which takes a long time, and then the others, which follow almost immediately - except for #9, which was deduplicated, and must be fetched from somewhere else on disk.

In practice, I guess, with all the optimisations and layers in the system, it may be far more complicated, but securing the system against this kind of attack will be even more difficult.

Posted by Klaus Borchers on November 03, 2009 at 12:57 PM PST #

If this is turned on for an existing ZFS file system are pre-existing segments able to be deduplicated? If ZFS only supports synchronous dedupe does the data have to be pulled off and repopulated?

Posted by Jason Herring on November 03, 2009 at 03:33 PM PST #

As we rely heavily on ZFS for more than 2 Billion of Files in different Filesystems my question is: is dedup working across filesystems?

and is there a way to get the non dedup ratio compared to the dedup ratio on a zfs filesystem?

could we get the "not deduped" size of a filesystem? with df or do we receive the deduped size?

just some management questions

hope to test your good work soon!

Michael

Posted by Michael Widmann on November 03, 2009 at 10:53 PM PST #

First of all, thank you all for this feature, and I hope you and your families will get some quality time together before hacking on all the questions and requests in these comments :)

And I also thank the commenters for raising some interesting questions and ideas.

Concerning the matter of deduplicating data that's already on our disks, in our existing systems, I'm afraid we'd have to suffice for a while with a trick I use to compress previously uncompressed files (say, local zones). We shut down the zone, move its files to a subdirectory, and use Sun cpio (keeping ZFSacl's) to copy the files back to their expected location. Upon write, they are compressed (and nowadays they'd also be deduped). This can be tweaked to doing per-file copies/removals to minimize the free-space pressure when remaking existing systems.

Needless to say, some supported utility which allows to (un-)compress and (un-)dedup existing files in-place (like setting the Compressed flag on NTFS objects) is very welcome and long-awaited :) The already de-facto working capability of using different compression algorithms within the same ZFS dataset is also a bonus versus NTFS, and I'd love to see that in said utility. (In example of our local zones, the fresh install of binary files can be done with gzip-9, then new files like data and logs are written with a faster lzjb.)

Another question arises: what if we have same files (blocks) compressed with different algorithms? On-disk blocks apparently contain different (compressed) data and have different checksums for the same original data, and different amount of on-disk blocks for the same original files.

These would probably not dedup ultimately to one block, but to at least as many as there are different compression algorithms?

And for compressed blocks inside different original files (including VM images) the block-alignment would make it even less probable that we have dedupable whole blocks? Even a one-byte offset would not let us save space from otherwise same original data?

In short - does it mean we would (probably) save more space by not compressing certain filesystems (i.e. VM image containers) but rather only deduping them?

PS: I was surprised to see no follow-up to this comment:

> Well, now you just need to dedup your time schedule, and you'll have a lot of spare time blocks laying around that you can use for your kids!

Apparently, this strikes the family time too. Instead of going with kids to a zoo 10 times, "Jeff" only goes once and tells the family that they should remember it as 10 different trips ;)

Posted by Jim Klimov on November 04, 2009 at 05:15 AM PST #

Can ZFS use sha256 for self-healing instead deduplicaion?

Posted by QuAzI on November 04, 2009 at 05:41 AM PST #

QuAzl - zfs set checksum=sha256 dataset

see man zfs for more details
It's been there for ever.

Posted by Robert Milkowski on November 04, 2009 at 06:50 AM PST #

I know about checksum and how it's work for RAID-Z. But if ZFS can use sha256 for duplicates searching maybe they can use it for self-healing instead deduplicating?

Posted by QuAzI on November 04, 2009 at 07:46 AM PST #

@QuAzI
Sha256 in ZFS IS used for self healing if you set it as checksum algorithm.. It is used, instead of fletcher, to check integrity of every block in datasets, not only in RAIDZ. It was like that since creation of ZFS, and now this same checksum field can be used for two purposes, integrity and deduplication.

Posted by Bartek Pelc on November 04, 2009 at 08:52 AM PST #

Thanks. I don't found that in overviews and documents, just examples of self-healing of mirrors found. It's good.

Posted by QuAzI on November 04, 2009 at 09:21 AM PST #