From ahrens@sac.sfbay.sun.com Tue Oct 13 11:25:31 2009 Received: from newsunmail1brm.central.sun.com (newsunmail1brm.Central.Sun.COM [129.147.62.245]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9DIPVF7005135 for ; Tue, 13 Oct 2009 11:25:31 -0700 (PDT) Received: from nwk-avmta-1.SFBay.Sun.COM (nwk-avmta-1.SFBay.Sun.COM [129.146.11.74]) by newsunmail1brm.central.sun.com (8.13.7+Sun/8.13.7/ENSMAIL,v2.2) with ESMTP id n9DIPTHV058639; Tue, 13 Oct 2009 12:25:30 -0600 (MDT) Received: from pmxchannel-daemon.nwk-avmta-1.sfbay.Sun.COM by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRG00707TUI0700@nwk-avmta-1.sfbay.Sun.COM>; Tue, 13 Oct 2009 11:25:30 -0700 (PDT) Received: from localhost.sfbay.sun.com ([129.146.17.46]) by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRG00MLPTUH44C0@nwk-avmta-1.sfbay.Sun.COM>; Tue, 13 Oct 2009 11:25:29 -0700 (PDT) Received: from localhost.sfbay.sun.com (localhost [127.0.0.1] (may be forged)) by localhost.sfbay.sun.com (8.14.3+Sun/8.14.3) with ESMTP id n9DIPSTf028911; Tue, 13 Oct 2009 11:25:28 -0700 (PDT) Received: (from ahrens@localhost) by localhost.sfbay.sun.com (8.14.3+Sun/8.14.3/Submit) id n9DIPSkK028907; Tue, 13 Oct 2009 11:25:28 -0700 (PDT) Date: Tue, 13 Oct 2009 11:25:28 -0700 (PDT) From: Matthew Ahrens Subject: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] To: PSARC-ext@sun.com Cc: zfs-team@sun.com Message-id: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 Status: RO Content-Length: 5225 Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: ZFS send dedup 1.2. Name of Document Author/Supplier: Author: Lori Alt 1.3 Date of This Document: 13 October, 2009 4. Technical Description This case requests micro/patch binding; new interfaces are Comitted. 4. Technical Description OVERVIEW: "Dedup" is an overall term for technologies that eliminate duplicate copies of data in storage or memory. This specific application of dedup is for ZFS send streams, i.e., the output of the 'zfs send' command. For some kinds of data, much of the content of a send stream consists of blocks for which identical copies have already been sent earlier in the stream. This technology replaces later copies of a block with a reference to the earlier copy. This can significantly reduce the size of a send stream, which reduces the time it takes to transfer such a stream over a communication channel. PROPOSED SOLUTION: A new '-D' option to 'zfs send' is proposed. This option will cause dedup processing to be performed on the data being written to a send stream. Dedup processing is optional because it isn't always appropriate (some kinds of data have very little duplication) and it has significant costs: the checksumming required to detect duplicate blocks is CPU-intensive and the data that must be maintained while the stream is being processed can occupy a very large amount of memory. Duplicate blocks are detected by calculating a cryptographically strong checksum on each data block. Blocks that have the same checksum are presumed to be identical. The checksum type used at this time is SHA256. However, the stream format contains a field which identifies the checksum type, permitting other checksums to be used in the future. RELATION TO OTHER ZFS DEDUP WORK There are several other ongoing ZFS projects that are potentially related to this one: on-disk dedup, in-core dedup, and ZFS encryption (PSARC/2007/261). The relation between this project and the other projects is that over-the-wire (OTW) dedup does not depend on those projects, but will be able to take advantage of some aspects of the other dedup work when it is integrated. Dedup of send streams can be performed regardless of whether the other variants of dedup are operational. The main way that OTW dedup can take advantage of the other varieties of dedup support is that if a dedup-capable checksum of the data has already been calculated, the 'zfs send' processing will not recalculate it. It will use the already-computed checksum, thereby reducing the CPU usage of the stream dedup processing. The checksum of block send in dedup'ed streams will be included in the stream. This gives the receive side of the code the option to work with the in-core and on-disk dedup support to avoid the re-computation of the checksum when the data is stored in memory or on-disk. At this time, that option is not being used (because in-core and on-disk dedup are still in development), and it might not ever be used. But the interface has been designed in such a way to allow that optimization in the future. SEND STREAM FORMAT COMPATIBILITY IMPACT Over-the-wire dedup support requires a change to the format of a send stream. A new "write-by-reference" record is used to indicate a write operation that references data sent earlier in the stream. This new record type will only appear in dedup'ed streams. A feature flag indicating the use of dedup will be set in the streams "begin" record. Older version of 'zfs receive' will reject the stream as unreadable because of the presense of that feature flag. However, if dedup is not being done on the stream, older version of the zfs software will be able to read the stream (assuming that the objects recorded in the stream are of a version that can be interpreted by the version of zfs on the receiving system, but that is an existing requirement, not one added by this project). CHANGES TO THE ZFS(1M) MANPAGE 65c62 < zfs send [-vR] [-[iI] snapshot] snapshot --- > > zfs send [-DvR] [-[iI] snapshot] snapshot 1746c1677 < zfs send [-vR] [-[iI] snapshot] snapshot --- > > zfs send [-DvR] [-[iI] snapshot] snapshot 1753a1685,1689 > > -D > > Perform dedup processing on the stream. Dedup'ed streams > > cannot be received on systems that do not support the stream > > dedup feature. > > ATTRIBUTES See attributes(5) for descriptions of the following attributes: ____________________________________________________________ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | |_____________________________|_____________________________| | Availability | SUNWzfsu | |_____________________________|_____________________________| | Interface Stability | Committed | |_____________________________|_____________________________| 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open From Darren.Moffat@sun.com Tue Oct 13 11:40:04 2009 Received: from newsunmail1brm.central.sun.com (newsunmail1brm.Central.Sun.COM [129.147.62.245]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9DIe4aE005313 for ; Tue, 13 Oct 2009 11:40:04 -0700 (PDT) Received: from nwk-avmta-2.sfbay.sun.com (nwk-avmta-2.SFBay.Sun.COM [129.145.155.6]) by newsunmail1brm.central.sun.com (8.13.7+Sun/8.13.7/ENSMAIL,v2.2) with ESMTP id n9DIe3Hg002567; Tue, 13 Oct 2009 12:40:03 -0600 (MDT) Received: from pmxchannel-daemon.nwk-avmta-2.sfbay.sun.com by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRG00107UIQSR00@nwk-avmta-2.sfbay.sun.com>; Tue, 13 Oct 2009 11:40:02 -0700 (PDT) Received: from gmp-eb-inf-2.sun.com ([192.18.6.24]) by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRG00KL7UIP4950@nwk-avmta-2.sfbay.sun.com>; Tue, 13 Oct 2009 11:40:02 -0700 (PDT) Received: from fe-emea-09.sun.com (gmp-eb-lb-1-fe1.eu.sun.com [192.18.6.7] (may be forged)) by gmp-eb-inf-2.sun.com (8.13.7+Sun/8.12.9) with ESMTP id n9DIe1og020502; Tue, 13 Oct 2009 18:40:01 +0000 (GMT) Received: from conversion-daemon.fe-emea-09.sun.com by fe-emea-09.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) id <0KRG00400UBYNA00@fe-emea-09.sun.com>; Tue, 13 Oct 2009 19:39:42 +0100 (BST) Received: from [192.168.1.105] (cpc2-rdng20-2-0-cust917.15-3.cable.virginmedia.com [86.28.167.150]) by fe-emea-09.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) with ESMTPSA id <0KRG00BE5UI6KI00@fe-emea-09.sun.com>; Tue, 13 Oct 2009 19:39:42 +0100 (BST) Date: Tue, 13 Oct 2009 19:39:42 +0100 From: Darren J Moffat Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> Sender: Darren.Moffat@sun.com To: Matthew Ahrens Cc: PSARC-ext@sun.com, zfs-team@sun.com Message-id: <4AD4C96E.1090608@Sun.COM> MIME-version: 1.0 Content-type: text/plain; CHARSET=US-ASCII; format=flowed Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> User-Agent: Thunderbird 2.0.0.22 (X11/20090818) Status: RO Content-Length: 624 Matthew Ahrens wrote: > Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI > This information is Copyright 2009 Sun Microsystems > 1. Introduction > 1.1. Project/Component Working Name: > ZFS send dedup > 1.2. Name of Document Author/Supplier: > Author: Lori Alt > 1.3 Date of This Document: > 13 October, 2009 > 4. Technical Description > This case requests micro/patch binding; new interfaces are Comitted. +1. One tiny nit is that this case imports the SHA256 interfaces from libmd but given those are Committed I'm not going to be picky and say the materials need updating. -- Darren J Moffat From carlsonj@workingcode.com Tue Oct 13 12:07:07 2009 Received: from newsunmail1brm.central.sun.com (newsunmail1brm.Central.Sun.COM [129.147.62.245]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9DJ76sV006165 for ; Tue, 13 Oct 2009 12:07:06 -0700 (PDT) Received: from brm-avmta-1.central.sun.com (brm-avmta-1.Central.Sun.COM [129.147.4.11]) by newsunmail1brm.central.sun.com (8.13.7+Sun/8.13.7/ENSMAIL,v2.2) with ESMTP id n9DJ73oa014483 for <@sunmail2sca.sfbay.sun.com:PSARC-ext@sun.com>; Tue, 13 Oct 2009 13:07:06 -0600 (MDT) Received: from pmxchannel-daemon.brm-avmta-1.central.sun.com by brm-avmta-1.central.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRG00321VRTRG00@brm-avmta-1.central.sun.com> for PSARC-ext@sun.com (ORCPT PSARC-ext@sun.com); Tue, 13 Oct 2009 13:07:05 -0600 (MDT) Received: from sca-ea-mail-3.sun.com ([192.18.43.21]) by brm-avmta-1.central.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRG00C0HVRSES90@brm-avmta-1.central.sun.com> for PSARC-ext@sun.com (ORCPT PSARC-ext@sun.com); Tue, 13 Oct 2009 13:07:04 -0600 (MDT) Received: from relay11i.sun.com (ip121.net129179-4.block1.us.syntegra.com [129.179.4.121]) by sca-ea-mail-3.sun.com (8.13.6+Sun/8.12.9) with ESMTP id n9DJ4C9f011755 for ; Tue, 13 Oct 2009 19:07:04 +0000 (GMT) Received: from mmp13es.mmp.us.syntegra.com ([160.41.208.13] [160.41.208.13]) by relay11i.sun.com with ESMTP id BT-MMP-4009470; Tue, 13 Oct 2009 19:07:03 +0000 (Z) Received: from relay15i.sun.com (relay15i.sun.com [129.179.4.125]) by mmp13es.mmp.us.syntegra.com with ESMTP id BT-MMP-684681; Tue, 13 Oct 2009 19:07:03 +0000 (Z) Received: from carlson.workingcode.com ([75.150.68.97] [75.150.68.97]) by relay1i.sun.com with ESMTP id BT-MMP-18300549; Tue, 13 Oct 2009 19:07:03 +0000 (Z) Received: from [10.50.24.188] (gate.abinitio.com [65.170.40.132]) (authenticated bits=0) by carlson.workingcode.com (8.14.2+Sun/8.14.3) with ESMTP id n9DJ6xAY005564 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 13 Oct 2009 15:07:00 -0400 (EDT) Date: Tue, 13 Oct 2009 15:06:59 -0400 From: James Carlson Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> To: Matthew Ahrens Cc: PSARC-ext@sun.com, zfs-team@sun.com Message-id: <4AD4CFD3.1020703@workingcode.com> MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 X-Brightmail-Tracker: AAAAAA== X-DCC-dcc1.aftenposten.no-Metrics: carlson 1215; Body=3 Fuz1=3 Fuz2=3 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> User-Agent: Thunderbird 2.0.0.22 (X11/20090605) Status: RO Content-Length: 1464 Matthew Ahrens wrote: > A new '-D' option to 'zfs send' is proposed. This option will cause > dedup processing to be performed on the data being written to a send > stream. Dedup processing is optional because it isn't always appropriate > (some kinds of data have very little duplication) and it has significant > costs: the checksumming required to detect duplicate blocks is > CPU-intensive and the data that must be maintained while the stream > is being processed can occupy a very large amount of memory. "Must" seems a little strong. As it's just an optimization, throwing away old checksums if you have a large number of new ones to store -- and thus possibly sending some things uncompressed that could have been compressed if you'd had infinite memory -- seems like a plausible trade-off to avoid using "very large" amounts of memory. Moreover, if you find that you're seeing a lot of novel checksums (and thus using up a lot of memory), then that also implies that you're not getting much compression bang for the buck, and you might want to disable compression on the fly. (Many stream processing compressors do something like this; disabling the compressor at least temporarily if the compression ratio drops below some set limit.) How often does the user know for certain whether the undocumented data stream actually has a lot or a little duplicated data blocks? -- James Carlson 42.703N 71.076W From Nicolas.Williams@sun.com Tue Oct 13 12:40:44 2009 Received: from sunmail4.singapore.sun.com (sunmail4.Singapore.Sun.COM [129.158.71.19]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9DJeglU006847 for ; Tue, 13 Oct 2009 12:40:43 -0700 (PDT) Received: from nwk-avmta-2.sfbay.sun.com (nwk-avmta-2.SFBay.Sun.COM [129.145.155.6]) by sunmail4.singapore.sun.com (8.13.4+Sun/8.13.3/ENSMAIL,v2.2) with ESMTP id n9DJeeld026722; Wed, 14 Oct 2009 03:40:42 +0800 (SGT) Received: from pmxchannel-daemon.nwk-avmta-2.sfbay.sun.com by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRG00503XBRR300@nwk-avmta-2.sfbay.sun.com>; Tue, 13 Oct 2009 12:40:39 -0700 (PDT) Received: from binky.Central.Sun.COM ([129.153.128.104]) by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRG00K7FXBQ4K90@nwk-avmta-2.sfbay.sun.com>; Tue, 13 Oct 2009 12:40:38 -0700 (PDT) Received: from binky.Central.Sun.COM (localhost [127.0.0.1]) by binky.Central.Sun.COM (8.14.3+Sun/8.14.3) with ESMTP id n9DJajTS008903; Tue, 13 Oct 2009 14:36:45 -0500 (CDT) Received: (from nw141292@localhost) by binky.Central.Sun.COM (8.14.3+Sun/8.14.3/Submit) id n9DJagYG008902; Tue, 13 Oct 2009 14:36:42 -0500 (CDT) Date: Tue, 13 Oct 2009 14:36:42 -0500 From: Nicolas Williams Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <4AD4CFD3.1020703@workingcode.com> To: James Carlson Cc: Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Message-id: <20091013193642.GK887@Sun.COM> MIME-version: 1.0 Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7BIT Content-disposition: inline X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> X-Authentication-warning: binky.Central.Sun.COM: nw141292 set sender to Nicolas.Williams@sun.com using -f User-Agent: Mutt/1.5.7i Status: RO Content-Length: 1945 On Tue, Oct 13, 2009 at 03:06:59PM -0400, James Carlson wrote: > Matthew Ahrens wrote: > > A new '-D' option to 'zfs send' is proposed. This option will cause > > dedup processing to be performed on the data being written to a send > > stream. Dedup processing is optional because it isn't always appropriate > > (some kinds of data have very little duplication) and it has significant > > costs: the checksumming required to detect duplicate blocks is > > CPU-intensive and the data that must be maintained while the stream > > is being processed can occupy a very large amount of memory. > > "Must" seems a little strong. As it's just an optimization, throwing > away old checksums if you have a large number of new ones to store -- > and thus possibly sending some things uncompressed that could have been > compressed if you'd had infinite memory -- seems like a plausible > trade-off to avoid using "very large" amounts of memory. Moreover, if > you find that you're seeing a lot of novel checksums (and thus using up > a lot of memory), then that also implies that you're not getting much > compression bang for the buck, and you might want to disable compression > on the fly. (Many stream processing compressors do something like this; > disabling the compressor at least temporarily if the compression ratio > drops below some set limit.) Throwing away of cached blocks probably needs to be done synchronously by both ends, or else the receiver has to at least keep an index of block checksum to block pointer for all previously seen blocks in the stream. Synchronizing the caches may require additional records in the stream. But I agree with you: it should be possible to bound the memory usage of zfs send dedup. Also, in ZFS today block checksums are used for integrity protection, not for block equality comparisons. The fact that here blocks would not be compared for actual equality does worry me somewhat. Nico -- From Lori.Alt@sun.com Tue Oct 13 13:01:12 2009 Received: from sunmail5.uk.sun.com (sunmail5.UK.Sun.COM [129.156.85.165]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9DK1AbQ008222 for ; Tue, 13 Oct 2009 13:01:11 -0700 (PDT) Received: from nwk-avmta-1.SFBay.Sun.COM (nwk-avmta-1.SFBay.Sun.COM [129.146.11.74]) by sunmail5.uk.sun.com (8.13.8+Sun/8.13.8/ENSMAIL,v2.2) with ESMTP id n9DK0nl9020054; Tue, 13 Oct 2009 21:01:09 +0100 (BST) Received: from pmxchannel-daemon.nwk-avmta-1.sfbay.Sun.COM by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRG00J51Y9V6P00@nwk-avmta-1.sfbay.Sun.COM>; Tue, 13 Oct 2009 13:01:07 -0700 (PDT) Received: from brmea-mail-4.sun.com ([192.18.98.36]) by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRG008B7Y9V1270@nwk-avmta-1.sfbay.Sun.COM>; Tue, 13 Oct 2009 13:01:07 -0700 (PDT) Received: from fe-amer-10.sun.com ([192.18.109.80]) by brmea-mail-4.sun.com (8.13.6+Sun/8.12.9) with ESMTP id n9DK16X3003110; Tue, 13 Oct 2009 20:01:06 +0000 (GMT) Received: from conversion-daemon.mail-amer.sun.com by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) id <0KRG00300XZVMY00@mail-amer.sun.com>; Tue, 13 Oct 2009 14:01:06 -0600 (MDT) Received: from [172.20.24.226] ([unknown] [172.20.24.226]) by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) with ESMTPSA id <0KRG000B3Y9N26E0@mail-amer.sun.com>; Tue, 13 Oct 2009 14:00:59 -0600 (MDT) Date: Tue, 13 Oct 2009 14:00:21 -0600 From: Lori Alt Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <20091013193642.GK887@Sun.COM> Sender: Lori.Alt@sun.com To: Nicolas Williams Cc: James Carlson , Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Reply-to: Lori.Alt@sun.com Message-id: <4AD4DC55.7030300@Sun.COM> MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_9bTO6t6/ureppH5DulAvGg)" X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> User-Agent: Thunderbird 2.0.0.21 (X11/20090622) Status: RO Content-Length: 5880 This is a multi-part message in MIME format. --Boundary_(ID_9bTO6t6/ureppH5DulAvGg) Content-type: text/plain; CHARSET=US-ASCII; format=flowed Content-transfer-encoding: 7BIT On 10/13/09 13:36, Nicolas Williams wrote: > On Tue, Oct 13, 2009 at 03:06:59PM -0400, James Carlson wrote: > >> Matthew Ahrens wrote: >> >>> A new '-D' option to 'zfs send' is proposed. This option will cause >>> dedup processing to be performed on the data being written to a send >>> stream. Dedup processing is optional because it isn't always appropriate >>> (some kinds of data have very little duplication) and it has significant >>> costs: the checksumming required to detect duplicate blocks is >>> CPU-intensive and the data that must be maintained while the stream >>> is being processed can occupy a very large amount of memory. >>> >> "Must" seems a little strong. As it's just an optimization, throwing >> away old checksums if you have a large number of new ones to store -- >> and thus possibly sending some things uncompressed that could have been >> compressed if you'd had infinite memory -- seems like a plausible >> trade-off to avoid using "very large" amounts of memory. Moreover, if >> you find that you're seeing a lot of novel checksums (and thus using up >> a lot of memory), then that also implies that you're not getting much >> compression bang for the buck, and you might want to disable compression >> on the fly. (Many stream processing compressors do something like this; >> disabling the compressor at least temporarily if the compression ratio >> drops below some set limit.) >> > > Throwing away of cached blocks probably needs to be done synchronously > by both ends, or else the receiver has to at least keep an index of > block checksum to block pointer for all previously seen blocks in the > stream. Synchronizing the caches may require additional records in the > stream. But I agree with you: it should be possible to bound the memory > usage of zfs send dedup. > Yes, the memory usage can be bounded. It was our plan at this time however to regard that as an implementation detail, not part of the interface to be approved by this case. > Also, in ZFS today block checksums are used for integrity protection, > not for block equality comparisons. The fact that here blocks would not > be compared for actual equality does worry me somewhat > The plan is to use a SHA256 checksum, or something comparably strong, so that the probability of collision becomes too small to worry about. Perhaps Darren Moffat can weigh in on why this kind of checksum is adequate, because I'm pretty much taking his word for it. Lori --Boundary_(ID_9bTO6t6/ureppH5DulAvGg) Content-type: text/html; CHARSET=US-ASCII Content-transfer-encoding: 7BIT On 10/13/09 13:36, Nicolas Williams wrote:
On Tue, Oct 13, 2009 at 03:06:59PM -0400, James Carlson wrote:
  
Matthew Ahrens wrote:
    
A new '-D' option to 'zfs send' is proposed.  This option will cause
dedup processing to be performed on the data being written to a send
stream.  Dedup processing is optional because it isn't always appropriate
(some kinds of data have very little duplication) and it has significant
costs:  the checksumming required to detect duplicate blocks is
CPU-intensive and the data that must be maintained while the stream
is being processed can occupy a very large amount of memory.
      
"Must" seems a little strong.  As it's just an optimization, throwing
away old checksums if you have a large number of new ones to store --
and thus possibly sending some things uncompressed that could have been
compressed if you'd had infinite memory -- seems like a plausible
trade-off to avoid using "very large" amounts of memory.  Moreover, if
you find that you're seeing a lot of novel checksums (and thus using up
a lot of memory), then that also implies that you're not getting much
compression bang for the buck, and you might want to disable compression
on the fly.  (Many stream processing compressors do something like this;
disabling the compressor at least temporarily if the compression ratio
drops below some set limit.)
    

Throwing away of cached blocks probably needs to be done synchronously
by both ends, or else the receiver has to at least keep an index of
block checksum to block pointer for all previously seen blocks in the
stream.  Synchronizing the caches may require additional records in the
stream.  But I agree with you: it should be possible to bound the memory
usage of zfs send dedup.
  
Yes, the memory usage can be bounded.   It was our plan at this time however to regard that as an implementation detail, not part of the interface to be approved by this case.
Also, in ZFS today block checksums are used for integrity protection,
not for block equality comparisons.  The fact that here blocks would not
be compared for actual equality does worry me somewhat
  
The plan is to use a SHA256 checksum, or something comparably strong, so that the probability of collision becomes too small to worry about.  Perhaps Darren Moffat can weigh in on why this kind of checksum is adequate, because I'm pretty much taking his word for it.

Lori


--Boundary_(ID_9bTO6t6/ureppH5DulAvGg)-- From carlsonj@workingcode.com Tue Oct 13 13:01:35 2009 Received: from sunmail3mpk.sfbay.sun.com (sunmail3mpk.SFBay.Sun.COM [129.146.11.52]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9DK1Y9s008258 for ; Tue, 13 Oct 2009 13:01:34 -0700 (PDT) Received: from nwk-avmta-2.sfbay.sun.com (nwk-avmta-2.SFBay.Sun.COM [129.145.155.6]) by sunmail3mpk.sfbay.sun.com (8.13.8+Sun/8.13.8/ENSMAIL,v2.4) with ESMTP id n9DK1Wog021312; Tue, 13 Oct 2009 13:01:33 -0700 (PDT) Received: from pmxchannel-daemon.nwk-avmta-2.sfbay.sun.com by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRG0070NYAK6S00@nwk-avmta-2.sfbay.sun.com>; Tue, 13 Oct 2009 13:01:32 -0700 (PDT) Received: from sca-ea-mail-2.sun.com ([192.18.43.25]) by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRG00K7HYAJ4L90@nwk-avmta-2.sfbay.sun.com>; Tue, 13 Oct 2009 13:01:31 -0700 (PDT) Received: from relay13i.sun.com (ip123.net129179-4.block1.us.syntegra.com [129.179.4.123]) by sca-ea-mail-2.sun.com (8.13.7+Sun/8.12.9) with ESMTP id n9DK1UXQ012450; Tue, 13 Oct 2009 20:01:31 +0000 (GMT) Received: from mmp12es.mmp.us.syntegra.com ([160.41.208.12] [160.41.208.12]) by relay13i.sun.com with ESMTP id BT-MMP-4011276; Tue, 13 Oct 2009 19:59:30 +0000 (Z) Received: from relay14i.sun.com (relay14i.sun.com [129.179.4.124]) by mmp12es.mmp.us.syntegra.com with ESMTP id BT-MMP-770646; Tue, 13 Oct 2009 19:59:30 +0000 (Z) Received: from carlson.workingcode.com ([75.150.68.97] [75.150.68.97]) by relay1i.sun.com with ESMTP id BT-MMP-269348; Tue, 13 Oct 2009 19:59:30 +0000 (Z) Received: from [10.50.24.188] (gate.abinitio.com [65.170.40.132]) (authenticated bits=0) by carlson.workingcode.com (8.14.2+Sun/8.14.3) with ESMTP id n9DJxRIt013729 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 13 Oct 2009 15:59:27 -0400 (EDT) Date: Tue, 13 Oct 2009 15:59:27 -0400 From: James Carlson Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <20091013193642.GK887@Sun.COM> To: Nicolas Williams Cc: Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Message-id: <4AD4DC1F.1070400@workingcode.com> MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 X-Brightmail-Tracker: AAAAAA== X-DCC-dcc1.aftenposten.no-Metrics: carlson 1215; Body=4 Fuz1=4 Fuz2=4 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> User-Agent: Thunderbird 2.0.0.22 (X11/20090605) Status: RO Content-Length: 1307 Nicolas Williams wrote: > Throwing away of cached blocks probably needs to be done synchronously > by both ends, or else the receiver has to at least keep an index of > block checksum to block pointer for all previously seen blocks in the > stream. Synchronizing the caches may require additional records in the > stream. But I agree with you: it should be possible to bound the memory > usage of zfs send dedup. Good point. Actually, if you go far enough, you'll reinvent LZW. :-/ > Also, in ZFS today block checksums are used for integrity protection, > not for block equality comparisons. The fact that here blocks would not > be compared for actual equality does worry me somewhat. I'd briefly considered worrying about hitting an unfortunate birthday in SHA256, but then decided that since I probably wouldn't use this option (preferring instead to pass 'zfs send' output through a separate compressor), I didn't care much and also didn't know enough about the math involved to get excited about it. In other words, I'll assume that someone's looked into the integrity issues and verified that the probability of an accidental mismatch is similar to the risk of corruption in un-or-under-protected parts of memory. -- James Carlson 42.703N 71.076W From carlsonj@workingcode.com Tue Oct 13 14:19:29 2009 Received: from sunmail4.singapore.sun.com (sunmail4.Singapore.Sun.COM [129.158.71.19]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9DLJSn1010430 for ; Tue, 13 Oct 2009 14:19:29 -0700 (PDT) Received: from nwk-avmta-1.SFBay.Sun.COM (nwk-avmta-1.SFBay.Sun.COM [129.146.11.74]) by sunmail4.singapore.sun.com (8.13.4+Sun/8.13.3/ENSMAIL,v2.2) with ESMTP id n9DLJOMS019606; Wed, 14 Oct 2009 05:19:25 +0800 (SGT) Received: from pmxchannel-daemon.nwk-avmta-1.sfbay.Sun.COM by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRH006051WC0H00@nwk-avmta-1.sfbay.Sun.COM>; Tue, 13 Oct 2009 14:19:24 -0700 (PDT) Received: from brmea-mail-4.sun.com ([192.18.98.36]) by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRH008V21WB16D0@nwk-avmta-1.sfbay.Sun.COM>; Tue, 13 Oct 2009 14:19:24 -0700 (PDT) Received: from relay41i.sun.com ([192.5.209.70]) by brmea-mail-4.sun.com (8.13.6+Sun/8.12.9) with ESMTP id n9DLDQQE005849; Tue, 13 Oct 2009 21:19:23 +0000 (GMT) Received: from mms48es.mms.us.syntegra.com ([160.41.221.230] [160.41.221.230]) by relay41i.sun.com with ESMTP id BT-MMP-838194; Tue, 13 Oct 2009 21:19:23 +0000 (Z) Received: from relay44i.sun.com (relay44i.sun.com [192.5.209.118]) by mms48es.mms.us.syntegra.com with ESMTP id BT-MMP-691109; Tue, 13 Oct 2009 21:19:22 +0000 (Z) Received: from carlson.workingcode.com ([75.150.68.97] [75.150.68.97]) by relay4i.sun.com with ESMTP id BT-MMP-6582665; Tue, 13 Oct 2009 21:19:22 +0000 (Z) Received: from [10.50.24.188] (gate.abinitio.com [65.170.40.132]) (authenticated bits=0) by carlson.workingcode.com (8.14.2+Sun/8.14.3) with ESMTP id n9DLJMtZ025366 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 13 Oct 2009 17:19:22 -0400 (EDT) Date: Tue, 13 Oct 2009 17:19:21 -0400 From: James Carlson Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <4AD4DC55.7030300@Sun.COM> To: Lori.Alt@sun.com Cc: Nicolas Williams , Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Message-id: <4AD4EED9.4030700@workingcode.com> MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 X-Brightmail-Tracker: AAAAAA== X-DCC-dmv.com-Metrics: carlson 1181; Body=5 Fuz1=5 Fuz2=5 X-Antispam: No, score=-0.2/5.0, scanned in 0.131sec at (localhost [127.0.0.1]) by smf-spamd v1.3.1 - http://smfs.sf.net/ References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> <4AD4DC55.7030300@Sun.COM> User-Agent: Thunderbird 2.0.0.22 (X11/20090605) Status: RO Content-Length: 1564 Lori Alt wrote: > On 10/13/09 13:36, Nicolas Williams wrote: >> Throwing away of cached blocks probably needs to be done synchronously >> by both ends, or else the receiver has to at least keep an index of >> block checksum to block pointer for all previously seen blocks in the >> stream. Synchronizing the caches may require additional records in the >> stream. But I agree with you: it should be possible to bound the memory >> usage of zfs send dedup. >> > Yes, the memory usage can be bounded. It was our plan at this time > however to regard that as an implementation detail, not part of the > interface to be approved by this case. It becomes part of the interface if (a) the sender needs to notify the recipient of table flushes (as Nico reasonably suggested) or potentially (b) it becomes part of the usage considerations for users. There's actually a good bit of prior art to draw on here from other stream compression schemes. >> Also, in ZFS today block checksums are used for integrity protection, >> not for block equality comparisons. The fact that here blocks would not >> be compared for actual equality does worry me somewhat >> > The plan is to use a SHA256 checksum, or something comparably strong, so > that the probability of collision becomes too small to worry about. > Perhaps Darren Moffat can weigh in on why this kind of checksum is > adequate, because I'm pretty much taking his word for it. That's the sort of review I was hoping for. ;-} -- James Carlson 42.703N 71.076W From Lori.Alt@sun.com Tue Oct 13 15:01:53 2009 Received: from sunmail3mpk.sfbay.sun.com (sunmail3mpk.SFBay.Sun.COM [129.146.11.52]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9DM1ri7011643 for ; Tue, 13 Oct 2009 15:01:53 -0700 (PDT) Received: from brm-avmta-1.central.sun.com (brm-avmta-1.Central.Sun.COM [129.147.4.11]) by sunmail3mpk.sfbay.sun.com (8.13.8+Sun/8.13.8/ENSMAIL,v2.4) with ESMTP id n9DM1qYi010055; Tue, 13 Oct 2009 15:01:53 -0700 (PDT) Received: from pmxchannel-daemon.brm-avmta-1.central.sun.com by brm-avmta-1.central.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRH00L0J3V4SL00@brm-avmta-1.central.sun.com>; Tue, 13 Oct 2009 16:01:52 -0600 (MDT) Received: from brmea-mail-4.sun.com ([192.18.98.36]) by brm-avmta-1.central.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRH00BPL3UXJJ90@brm-avmta-1.central.sun.com>; Tue, 13 Oct 2009 16:01:45 -0600 (MDT) Received: from fe-amer-10.sun.com ([192.18.109.80]) by brmea-mail-4.sun.com (8.13.6+Sun/8.12.9) with ESMTP id n9DM1jBd024941; Tue, 13 Oct 2009 22:01:45 +0000 (GMT) Received: from conversion-daemon.mail-amer.sun.com by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) id <0KRH001003JM1J00@mail-amer.sun.com>; Tue, 13 Oct 2009 16:01:45 -0600 (MDT) Received: from [172.20.24.226] ([unknown] [172.20.24.226]) by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) with ESMTPSA id <0KRH00KN53UQBE50@mail-amer.sun.com>; Tue, 13 Oct 2009 16:01:39 -0600 (MDT) Date: Tue, 13 Oct 2009 16:01:01 -0600 From: Lori Alt Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <4AD4EED9.4030700@workingcode.com> Sender: Lori.Alt@sun.com To: James Carlson Cc: Nicolas Williams , Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Reply-to: Lori.Alt@sun.com Message-id: <4AD4F89D.90005@Sun.COM> MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_EGpngqpNJVUTH1OnIBnYqA)" X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> <4AD4DC55.7030300@Sun.COM> <4AD4EED9.4030700@workingcode.com> User-Agent: Thunderbird 2.0.0.21 (X11/20090622) Status: RO Content-Length: 6973 This is a multi-part message in MIME format. --Boundary_(ID_EGpngqpNJVUTH1OnIBnYqA) Content-type: text/plain; CHARSET=US-ASCII; format=flowed Content-transfer-encoding: 7BIT On 10/13/09 15:19, James Carlson wrote: > Lori Alt wrote: > >> On 10/13/09 13:36, Nicolas Williams wrote: >> >>> Throwing away of cached blocks probably needs to be done synchronously >>> by both ends, or else the receiver has to at least keep an index of >>> block checksum to block pointer for all previously seen blocks in the >>> stream. Synchronizing the caches may require additional records in the >>> stream. But I agree with you: it should be possible to bound the memory >>> usage of zfs send dedup. >>> >>> >> Yes, the memory usage can be bounded. It was our plan at this time >> however to regard that as an implementation detail, not part of the >> interface to be approved by this case. >> > > It becomes part of the interface if (a) the sender needs to notify the > recipient of table flushes (as Nico reasonably suggested) or potentially > (b) it becomes part of the usage considerations for users. There's > actually a good bit of prior art to draw on here from other stream > compression schemes. > > I missed Nico's suggestion about notification of the recipient for cache flushes. Actually, there is no need for a cache on the receive side. Or more exactly, the dataset hierarchy constructed by the receive IS the cache. The new write-by-reference record in the send stream essentially sends this information: * identification of where the data can be found already on the target system (i.e. the object set, the object, and the offset and length within the object) * the location where the data is to be written (object set, object, and offset). During the receive, all datasets being received are "held" and not deletable until the receive completes, so the data is guaranteed to be present. There is no need to maintain an index of block checksum to block pointer on the receive side. There IS a need to maintain this on the send side, which is where memory management is an issue. As for the send-side memory management, I agree that we could establish a public interface by which a caller can constrain the memory to be used. However, we were thinking that if such an interface turns out to be necessary, we could define it and add it later, once we gain more experience with how over-the-wire dedup gets used in practice. I don't know whether the kinds of on-the-fly compression disabling that James mentions are relevant for dedup'ing. For example, in one of my test cases, which is a hierarchy of datasets that contain Solaris development workspaces, you can go for a long time without finding more than a handful of duplicate blocks, but once you've finished with one development workspace and started on the next one, then you start getting lots of duplicates because now you're seeing identical copies of the files you processed in the first dataset. This is just one kind of data, but in general, it's hard to predict at what point in the stream you're going to start getting dedup'ing bang for your memory-hogging buck. Lori --Boundary_(ID_EGpngqpNJVUTH1OnIBnYqA) Content-type: text/html; CHARSET=US-ASCII Content-transfer-encoding: 7BIT On 10/13/09 15:19, James Carlson wrote:
Lori Alt wrote:
  
On 10/13/09 13:36, Nicolas Williams wrote:
    
Throwing away of cached blocks probably needs to be done synchronously
by both ends, or else the receiver has to at least keep an index of
block checksum to block pointer for all previously seen blocks in the
stream.  Synchronizing the caches may require additional records in the
stream.  But I agree with you: it should be possible to bound the memory
usage of zfs send dedup.
  
      
Yes, the memory usage can be bounded.   It was our plan at this time
however to regard that as an implementation detail, not part of the
interface to be approved by this case.
    

It becomes part of the interface if (a) the sender needs to notify the
recipient of table flushes (as Nico reasonably suggested) or potentially
(b) it becomes part of the usage considerations for users.  There's
actually a good bit of prior art to draw on here from other stream
compression schemes.

  
I  missed Nico's suggestion about notification of the recipient for cache flushes.  Actually, there is no need for a cache on the receive side.  Or more exactly, the dataset hierarchy constructed by the receive IS the cache.  The new write-by-reference record in the send stream essentially sends this information:

* identification of where the data can be found already  on the target system (i.e. the object set, the object, and the offset and length within the object)

* the location where the data is to be written (object set, object, and offset). 

During the receive, all datasets being received are "held" and not deletable until the receive completes, so the data is guaranteed to be present.  There is no need to maintain an index of block checksum to block pointer on the receive side. There IS a need to maintain this on the send side, which is where memory management is an issue.

As for the send-side  memory management, I agree that we could establish a public interface by which a caller can constrain the memory to be used.  However, we were thinking that if such an interface turns out to be necessary, we could define it and add it later, once we gain more experience with how over-the-wire dedup gets used in practice.

I don't know whether the kinds of on-the-fly compression disabling that James mentions are relevant for dedup'ing.  For example, in one of my test cases, which is a hierarchy of datasets that contain Solaris development workspaces, you can go for a long time without finding more than a handful of duplicate blocks, but once you've finished with one development workspace and started on the next one, then you start getting lots of duplicates because now you're seeing identical copies of the files you processed in the first dataset.  This is just one  kind of data, but in general, it's hard to predict at what point in the stream you're going to start getting dedup'ing bang for your memory-hogging buck.

Lori








--Boundary_(ID_EGpngqpNJVUTH1OnIBnYqA)-- From Lori.Alt@sun.com Thu Oct 15 11:18:54 2009 Received: from sunmail5.uk.sun.com (sunmail5.UK.Sun.COM [129.156.85.165]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9FIIrGL028240 for ; Thu, 15 Oct 2009 11:18:54 -0700 (PDT) Received: from nwk-avmta-1.SFBay.Sun.COM (nwk-avmta-1.SFBay.Sun.COM [129.146.11.74]) by sunmail5.uk.sun.com (8.13.8+Sun/8.13.8/ENSMAIL,v2.2) with ESMTP id n9FIIeH3000027; Thu, 15 Oct 2009 19:18:53 +0100 (BST) Received: from pmxchannel-daemon.nwk-avmta-1.sfbay.Sun.COM by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRK00J0TIVGW900@nwk-avmta-1.sfbay.Sun.COM>; Thu, 15 Oct 2009 11:18:52 -0700 (PDT) Received: from brmea-mail-1.sun.com ([192.18.98.31]) by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRK00DMZIVEQG80@nwk-avmta-1.sfbay.Sun.COM>; Thu, 15 Oct 2009 11:18:51 -0700 (PDT) Received: from fe-amer-09.sun.com ([192.18.109.79]) by brmea-mail-1.sun.com (8.13.6+Sun/8.12.9) with ESMTP id n9FIIoct022690; Thu, 15 Oct 2009 18:18:50 +0000 (GMT) Received: from conversion-daemon.mail-amer.sun.com by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) id <0KRK00100HNQXN00@mail-amer.sun.com>; Thu, 15 Oct 2009 12:18:50 -0600 (MDT) Received: from [172.20.25.227] ([unknown] [172.20.25.227]) by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) with ESMTPSA id <0KRK00863IUN6K10@mail-amer.sun.com>; Thu, 15 Oct 2009 12:18:23 -0600 (MDT) Date: Thu, 15 Oct 2009 12:17:44 -0600 From: Lori Alt Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <20091013193642.GK887@Sun.COM> Sender: Lori.Alt@sun.com To: Nicolas Williams Cc: James Carlson , Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Reply-to: Lori.Alt@sun.com Message-id: <4AD76748.4070306@Sun.COM> MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_Eio28eaaJC8PRNrlhyFxew)" X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> User-Agent: Thunderbird 2.0.0.19 (X11/20090218) Status: RO Content-Length: 1681 This is a multi-part message in MIME format. --Boundary_(ID_Eio28eaaJC8PRNrlhyFxew) Content-type: text/plain; CHARSET=US-ASCII; format=flowed Content-transfer-encoding: 7BIT > Also, in ZFS today block checksums are used for integrity protection, > not for block equality comparisons. The fact that here blocks would not > be compared for actual equality does worry me somewhat. > > I presented the question :"Are SHA256 questions good enough to establish block equality?" to Jeff Bonwick. His answer: > Yes. Collision probability is 10-77, i.e. 77 nines. Nothing else > in a computer is even close to that reliable. lori --Boundary_(ID_Eio28eaaJC8PRNrlhyFxew) Content-type: text/html; CHARSET=US-ASCII Content-transfer-encoding: 7BIT
Also, in ZFS today block checksums are used for integrity protection,
not for block equality comparisons.  The fact that here blocks would not
be compared for actual equality does worry me somewhat.

  

I presented the question :"Are SHA256 questions good enough to establish block equality?"  to Jeff Bonwick.  His answer:

Yes.  Collision probability is 10-77, i.e. 77 nines.  Nothing else
in a computer is even close to that reliable.
lori

--Boundary_(ID_Eio28eaaJC8PRNrlhyFxew)-- From bhargava.yenduri@sun.com Thu Oct 15 12:01:21 2009 Received: from sunmail5.uk.sun.com (sunmail5.UK.Sun.COM [129.156.85.165]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9FJ1KxN029013 for ; Thu, 15 Oct 2009 12:01:20 -0700 (PDT) Received: from brm-avmta-1.central.sun.com (brm-avmta-1.Central.Sun.COM [129.147.4.11]) by sunmail5.uk.sun.com (8.13.8+Sun/8.13.8/ENSMAIL,v2.2) with ESMTP id n9FJ0lJF027765; Thu, 15 Oct 2009 20:01:18 +0100 (BST) Received: from pmxchannel-daemon.brm-avmta-1.central.sun.com by brm-avmta-1.central.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRK0054JKU2LS00@brm-avmta-1.central.sun.com>; Thu, 15 Oct 2009 13:01:14 -0600 (MDT) Received: from jurassic-x4600.sfbay.sun.com ([129.146.17.59]) by brm-avmta-1.central.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRK00ACOKU0WFE0@brm-avmta-1.central.sun.com>; Thu, 15 Oct 2009 13:01:13 -0600 (MDT) Received: from [129.146.108.66] (bluesky.SFBay.Sun.COM [129.146.108.66]) by jurassic-x4600.sfbay.sun.com (8.14.3+Sun/8.14.3) with ESMTP id n9FJ1BTG435970; Thu, 15 Oct 2009 12:01:12 -0700 (PDT) Date: Thu, 15 Oct 2009 11:58:27 -0700 From: Krishna Yenduri Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <4AD76748.4070306@Sun.COM> To: Lori.Alt@sun.com Cc: Nicolas Williams , James Carlson , Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Message-id: <4AD770D3.2060001@sun.com> MIME-version: 1.0 Content-type: multipart/alternative; boundary="Boundary_(ID_Lwx2/y2dc4Ds6h6g8T0GNA)" X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> <4AD76748.4070306@Sun.COM> User-Agent: Thunderbird 2.0.0.22 (X11/20090909) Status: RO Content-Length: 3205 This is a multi-part message in MIME format. --Boundary_(ID_Lwx2/y2dc4Ds6h6g8T0GNA) Content-type: text/plain; charset=ISO-8859-1; format=flowed Content-transfer-encoding: 7BIT On 10/15/09 11:17 AM, Lori Alt wrote: > >> Also, in ZFS today block checksums are used for integrity protection, >> not for block equality comparisons. The fact that here blocks would not >> be compared for actual equality does worry me somewhat. >> >> > > I presented the question :"Are SHA256 questions good enough to > establish block equality?" to Jeff Bonwick. His answer: > >> Yes. Collision probability is 10-77, i.e. 77 nines. Nothing else >> in a computer is even close to that reliable. Note that the probability of a collision also depends on the number of blocks in the stream. For example, one would need to do 2^128 SHA256 digests to get a probability of a collision > 0.5. There is a nice table at http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table that gives the upper bound on the number of blocks to achieve a given probability. I would agree that this is a reliable way to establish block equality given the number of blocks needed for even a probability of 10^-18. Regards, -Krishna --Boundary_(ID_Lwx2/y2dc4Ds6h6g8T0GNA) Content-type: text/html; charset=ISO-8859-1 Content-transfer-encoding: 7BIT On 10/15/09 11:17 AM, Lori Alt wrote:

Also, in ZFS today block checksums are used for integrity protection,
not for block equality comparisons.  The fact that here blocks would not
be compared for actual equality does worry me somewhat.

  

I presented the question :"Are SHA256 questions good enough to establish block equality?"  to Jeff Bonwick.  His answer:

Yes.  Collision probability is 10-77, i.e. 77 nines.  Nothing else
in a computer is even close to that reliable.

 Note that the probability of a collision also depends on the number of blocks
 in the stream. For example, one would need to do 2^128 SHA256 digests to
 get a probability of a collision > 0.5.

 There is a nice table at
 http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table
 that gives the upper bound on the number of blocks to achieve
 a given probability.

 I would agree that this is a reliable way to establish block equality
 given the number of blocks needed for even a probability of 10^-18.

Regards,
-Krishna
--Boundary_(ID_Lwx2/y2dc4Ds6h6g8T0GNA)-- From Scott.Rotondo@sun.com Thu Oct 15 17:35:02 2009 Received: from sunmail5.uk.sun.com (sunmail5.UK.Sun.COM [129.156.85.165]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9G0Z24f009660 for ; Thu, 15 Oct 2009 17:35:02 -0700 (PDT) Received: from nwk-avmta-1.SFBay.Sun.COM (nwk-avmta-1.SFBay.Sun.COM [129.146.11.74]) by sunmail5.uk.sun.com (8.13.8+Sun/8.13.8/ENSMAIL,v2.2) with ESMTP id n9G0YvLr021183; Fri, 16 Oct 2009 01:35:01 +0100 (BST) Received: from pmxchannel-daemon.nwk-avmta-1.sfbay.Sun.COM by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRL0000D0ABFO00@nwk-avmta-1.sfbay.Sun.COM>; Thu, 15 Oct 2009 17:34:59 -0700 (PDT) Received: from brmea-mail-1.sun.com ([192.18.98.31]) by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRL00KZO0AAKX50@nwk-avmta-1.sfbay.Sun.COM>; Thu, 15 Oct 2009 17:34:59 -0700 (PDT) Received: from fe-amer-09.sun.com ([192.18.109.79]) by brmea-mail-1.sun.com (8.13.6+Sun/8.12.9) with ESMTP id n9G0Ywbg014017; Fri, 16 Oct 2009 00:34:58 +0000 (GMT) Received: from conversion-daemon.mail-amer.sun.com by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) id <0KRL0080001VWZ00@mail-amer.sun.com>; Thu, 15 Oct 2009 18:34:58 -0600 (MDT) Received: from [129.146.108.62] ([unknown] [129.146.108.62]) by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) with ESMTPSA id <0KRL00CW80A9VM60@mail-amer.sun.com>; Thu, 15 Oct 2009 18:34:58 -0600 (MDT) Date: Thu, 15 Oct 2009 17:34:57 -0700 From: Scott Rotondo Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <4AD770D3.2060001@sun.com> Sender: Scott.Rotondo@sun.com To: Krishna Yenduri Cc: Lori.Alt@sun.com, Nicolas Williams , James Carlson , Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Message-id: <4AD7BFB1.4040008@sun.com> MIME-version: 1.0 Content-type: text/plain; CHARSET=US-ASCII; format=flowed Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> <4AD76748.4070306@Sun.COM> <4AD770D3.2060001@sun.com> User-Agent: Thunderbird 2.0.0.21 (X11/20090323) Status: RO Content-Length: 1741 Krishna Yenduri wrote: > On 10/15/09 11:17 AM, Lori Alt wrote: >> I presented the question :"Are SHA256 questions good enough to >> establish block equality?" to Jeff Bonwick. His answer: >> >>> Yes. Collision probability is 10-77, i.e. 77 nines. Nothing else >>> in a computer is even close to that reliable. > > Note that the probability of a collision also depends on the number of > blocks > in the stream. For example, one would need to do 2^128 SHA256 digests to > get a probability of a collision > 0.5. > > There is a nice table at > http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table > that gives the upper bound on the number of blocks to achieve > a given probability. > > I would agree that this is a reliable way to establish block equality > given the number of blocks needed for even a probability of 10^-18. > Perhaps it's worth pointing out that both statements above are correct, but they are answers to different questions. 10^-77 is the probability of a hash collision for a particular pair of blocks. For ZFS, we care if there is a collision between *any* pair of unequal blocks. That probability depends on the number of blocks, as Krishna points out. Finally, both of these calculations rely upon the implicit assumption that the 2^256 possible hash values are uniformly distributed; that assumption is widely accepted to be at least approximately true, but I'm not aware of a mathematical proof. In any case, I think it's safe to conclude that SHA-256 is more than adequate for filesystem block equality comparisons. Scott -- Scott Rotondo Principal Engineer, Solaris Security Technologies President, Trusted Computing Group Phone/FAX: +1 408 850 3655 (Internal x68278) From gdamore@Sun.COM Thu Oct 15 19:27:13 2009 Received: from sunmail4.singapore.sun.com (sunmail4.Singapore.Sun.COM [129.158.71.19]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9G2RB7N011014 for ; Thu, 15 Oct 2009 19:27:12 -0700 (PDT) Received: from nwk-avmta-1.SFBay.Sun.COM (nwk-avmta-1.SFBay.Sun.COM [129.146.11.74]) by sunmail4.singapore.sun.com (8.13.4+Sun/8.13.3/ENSMAIL,v2.2) with ESMTP id n9G2R8ST026761; Fri, 16 Oct 2009 10:27:10 +0800 (SGT) Received: from pmxchannel-daemon.nwk-avmta-1.sfbay.Sun.COM by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRL00G015H9PA00@nwk-avmta-1.sfbay.Sun.COM>; Thu, 15 Oct 2009 19:27:09 -0700 (PDT) Received: from sca-es-mail-2.sun.com ([192.18.43.133]) by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRL007L35H9T5A0@nwk-avmta-1.sfbay.Sun.COM>; Thu, 15 Oct 2009 19:27:09 -0700 (PDT) Received: from fe-sfbay-09.sun.com ([192.18.43.129]) by sca-es-mail-2.sun.com (8.13.7+Sun/8.12.9) with ESMTP id n9G2R9eN022423; Thu, 15 Oct 2009 19:27:09 -0700 (PDT) Received: from conversion-daemon.fe-sfbay-09.sun.com by fe-sfbay-09.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) id <0KRL006005CT2B00@fe-sfbay-09.sun.com>; Thu, 15 Oct 2009 19:27:09 -0700 (PDT) Received: from [192.168.251.11] ([unknown] [76.93.15.33]) by fe-sfbay-09.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) with ESMTPSA id <0KRL00IS65H79J40@fe-sfbay-09.sun.com>; Thu, 15 Oct 2009 19:27:08 -0700 (PDT) Date: Thu, 15 Oct 2009 19:27:07 -0700 From: "Garrett D'Amore" Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <4AD7BFB1.4040008@sun.com> Sender: Garrett.Damore@Sun.COM To: Scott Rotondo Cc: Krishna Yenduri , Lori.Alt@Sun.COM, Nicolas Williams , James Carlson , Matthew Ahrens , PSARC-ext@Sun.COM, zfs-team@Sun.COM Message-id: <4AD7D9FB.1040100@sun.com> MIME-version: 1.0 Content-type: text/plain; CHARSET=US-ASCII; format=flowed Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> <4AD76748.4070306@Sun.COM> <4AD770D3.2060001@sun.com> <4AD7BFB1.4040008@sun.com> User-Agent: Thunderbird 2.0.0.22 (X11/20090909) Status: RO Content-Length: 2072 Scott Rotondo wrote: > Krishna Yenduri wrote: >> On 10/15/09 11:17 AM, Lori Alt wrote: >>> I presented the question :"Are SHA256 questions good enough to >>> establish block equality?" to Jeff Bonwick. His answer: >>> >>>> Yes. Collision probability is 10-77, i.e. 77 nines. Nothing else >>>> in a computer is even close to that reliable. >> >> Note that the probability of a collision also depends on the number >> of blocks >> in the stream. For example, one would need to do 2^128 SHA256 >> digests to >> get a probability of a collision > 0.5. >> >> There is a nice table at >> http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table >> that gives the upper bound on the number of blocks to achieve >> a given probability. >> >> I would agree that this is a reliable way to establish block equality >> given the number of blocks needed for even a probability of 10^-18. >> > > Perhaps it's worth pointing out that both statements above are > correct, but they are answers to different questions. 10^-77 is the > probability of a hash collision for a particular pair of blocks. For > ZFS, we care if there is a collision between *any* pair of unequal > blocks. That probability depends on the number of blocks, as Krishna > points out. Finally, both of these calculations rely upon the implicit > assumption that the 2^256 possible hash values are uniformly > distributed; that assumption is widely accepted to be at least > approximately true, but I'm not aware of a mathematical proof. > > In any case, I think it's safe to conclude that SHA-256 is more than > adequate for filesystem block equality comparisons. That's true today. At what point will Moore's law catch up though? (In other words, how long will it take for storage densities to reach the point where where the risk of a collision becomes significant?) Start from a petabyte (probably about the largest practical filesystem size in use today), and double every 12 months. (I think storage has been outpacing Moore somewhat.) - Garrett > > Scott > From Scott.Rotondo@sun.com Thu Oct 15 22:01:22 2009 Received: from sunmail4.singapore.sun.com (sunmail4.Singapore.Sun.COM [129.158.71.19]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9G51LpQ013128 for ; Thu, 15 Oct 2009 22:01:22 -0700 (PDT) Received: from brm-avmta-1.central.sun.com (brm-avmta-1.Central.Sun.COM [129.147.4.11]) by sunmail4.singapore.sun.com (8.13.4+Sun/8.13.3/ENSMAIL,v2.2) with ESMTP id n9G51Je9021508; Fri, 16 Oct 2009 13:01:21 +0800 (SGT) Received: from pmxchannel-daemon.brm-avmta-1.central.sun.com by brm-avmta-1.central.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRL0011LCM79S00@brm-avmta-1.central.sun.com>; Thu, 15 Oct 2009 23:01:19 -0600 (MDT) Received: from brmea-mail-4.sun.com ([192.18.98.36]) by brm-avmta-1.central.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRL00535CM67BA0@brm-avmta-1.central.sun.com>; Thu, 15 Oct 2009 23:01:18 -0600 (MDT) Received: from fe-amer-10.sun.com ([192.18.109.80]) by brmea-mail-4.sun.com (8.13.6+Sun/8.12.9) with ESMTP id n9G51Hqe025094; Fri, 16 Oct 2009 05:01:17 +0000 (GMT) Received: from conversion-daemon.mail-amer.sun.com by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) id <0KRL00900CKO8300@mail-amer.sun.com>; Thu, 15 Oct 2009 23:01:17 -0600 (MDT) Received: from viaggio.local ([unknown] [69.226.240.14]) by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) with ESMTPSA id <0KRL00HBLCM4UQE0@mail-amer.sun.com>; Thu, 15 Oct 2009 23:01:17 -0600 (MDT) Date: Thu, 15 Oct 2009 22:01:17 -0700 From: Scott Rotondo Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <4AD7D9FB.1040100@sun.com> Sender: Scott.Rotondo@sun.com To: "Garrett D'Amore" Cc: Krishna Yenduri , Lori.Alt@sun.com, Nicolas Williams , James Carlson , Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Message-id: <4AD7FE1D.10801@sun.com> MIME-version: 1.0 Content-type: text/plain; CHARSET=US-ASCII; format=flowed Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> <4AD76748.4070306@Sun.COM> <4AD770D3.2060001@sun.com> <4AD7BFB1.4040008@sun.com> <4AD7D9FB.1040100@sun.com> User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812) Status: RO Content-Length: 1478 Garrett D'Amore wrote: > Scott Rotondo wrote: >> In any case, I think it's safe to conclude that SHA-256 is more than >> adequate for filesystem block equality comparisons. > > That's true today. At what point will Moore's law catch up though? > (In other words, how long will it take for storage densities to reach > the point where where the risk of a collision becomes significant?) > Start from a petabyte (probably about the largest practical filesystem > size in use today), and double every 12 months. (I think storage has > been outpacing Moore somewhat.) > To answer that question, consult the table Krishna provided: http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table First, select an acceptable collision probability. Let's choose 10^-18, which is the smallest probability found in the table, and (according to the same article) at the low end of the uncorrectable bit error rate for a typical hard disk. According to the table, SHA-256 can handle 4.8 x 10^29 (approx 2^98) blocks given our acceptable collision probability. That exceeds the ZFS limit of 2^64 *bytes* per filesystem. If we ignore the ZFS limit on filesystem size, and assume a disk block is 2K bytes, that's 2^59 petabytes. Your assumed rate of filesystem growth means we'll need a new plan in 60 years. Scott -- Scott Rotondo Principal Engineer, Solaris Security Technologies President, Trusted Computing Group Phone/FAX: +1 408 850 3655 (Internal x68278) From gdamore@sun.com Thu Oct 15 22:23:06 2009 Received: from sunmail5.uk.sun.com (sunmail5.UK.Sun.COM [129.156.85.165]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9G5N5nX013302 for ; Thu, 15 Oct 2009 22:23:05 -0700 (PDT) Received: from nwk-avmta-2.sfbay.sun.com (nwk-avmta-2.SFBay.Sun.COM [129.145.155.6]) by sunmail5.uk.sun.com (8.13.8+Sun/8.13.8/ENSMAIL,v2.2) with ESMTP id n9G5N2iu002786; Fri, 16 Oct 2009 06:23:04 +0100 (BST) Received: from pmxchannel-daemon.nwk-avmta-2.sfbay.sun.com by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRL00G0HDMEUP00@nwk-avmta-2.sfbay.sun.com>; Thu, 15 Oct 2009 22:23:02 -0700 (PDT) Received: from sca-es-mail-2.sun.com ([192.18.43.133]) by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRL00D03DME7E60@nwk-avmta-2.sfbay.sun.com>; Thu, 15 Oct 2009 22:23:02 -0700 (PDT) Received: from fe-sfbay-10.sun.com ([192.18.43.129]) by sca-es-mail-2.sun.com (8.13.7+Sun/8.12.9) with ESMTP id n9G5N166027399; Thu, 15 Oct 2009 22:23:01 -0700 (PDT) Received: from conversion-daemon.fe-sfbay-10.sun.com by fe-sfbay-10.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) id <0KRL00M00DI7NL00@fe-sfbay-10.sun.com>; Thu, 15 Oct 2009 22:23:01 -0700 (PDT) Received: from [192.168.251.11] ([unknown] [76.93.15.33]) by fe-sfbay-10.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) with ESMTPSA id <0KRL008RADM9OR40@fe-sfbay-10.sun.com>; Thu, 15 Oct 2009 22:22:58 -0700 (PDT) Date: Thu, 15 Oct 2009 22:22:57 -0700 From: "Garrett D'Amore" Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <4AD7FE1D.10801@sun.com> Sender: Garrett.Damore@sun.com To: Scott Rotondo Cc: Krishna Yenduri , Lori.Alt@sun.com, Nicolas Williams , James Carlson , Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Message-id: <4AD80331.4020907@sun.com> MIME-version: 1.0 Content-type: text/plain; CHARSET=US-ASCII; format=flowed Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> <4AD76748.4070306@Sun.COM> <4AD770D3.2060001@sun.com> <4AD7BFB1.4040008@sun.com> <4AD7D9FB.1040100@sun.com> <4AD7FE1D.10801@sun.com> User-Agent: Thunderbird 2.0.0.22 (X11/20090909) Status: RO Content-Length: 1484 Scott Rotondo wrote: > Garrett D'Amore wrote: >> Scott Rotondo wrote: >>> In any case, I think it's safe to conclude that SHA-256 is more than >>> adequate for filesystem block equality comparisons. >> >> That's true today. At what point will Moore's law catch up >> though? (In other words, how long will it take for storage >> densities to reach the point where where the risk of a collision >> becomes significant?) Start from a petabyte (probably about the >> largest practical filesystem size in use today), and double every 12 >> months. (I think storage has been outpacing Moore somewhat.) >> > > To answer that question, consult the table Krishna provided: > http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table > > First, select an acceptable collision probability. Let's choose > 10^-18, which is the smallest probability found in the table, and > (according to the same article) at the low end of the uncorrectable > bit error rate for a typical hard disk. > > According to the table, SHA-256 can handle 4.8 x 10^29 (approx 2^98) > blocks given our acceptable collision probability. That exceeds the > ZFS limit of 2^64 *bytes* per filesystem. > > If we ignore the ZFS limit on filesystem size, and assume a disk block > is 2K bytes, that's 2^59 petabytes. Your assumed rate of filesystem > growth means we'll need a new plan in 60 years. The fact that it exceeds the 2^64 limit is good enough for me. :-) -- Garrett > > Scott > From Darren.Moffat@sun.com Fri Oct 16 01:26:52 2009 Received: from sunmail5.uk.sun.com (sunmail5.UK.Sun.COM [129.156.85.165]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9G8QpWp026933 for ; Fri, 16 Oct 2009 01:26:52 -0700 (PDT) Received: from brm-avmta-1.central.sun.com (brm-avmta-1.Central.Sun.COM [129.147.4.11]) by sunmail5.uk.sun.com (8.13.8+Sun/8.13.8/ENSMAIL,v2.2) with ESMTP id n9G8Qkxk008360; Fri, 16 Oct 2009 09:26:50 +0100 (BST) Received: from pmxchannel-daemon.brm-avmta-1.central.sun.com by brm-avmta-1.central.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRL0000FM4PU100@brm-avmta-1.central.sun.com>; Fri, 16 Oct 2009 02:26:49 -0600 (MDT) Received: from gmp-eb-inf-1.sun.com ([192.18.6.21]) by brm-avmta-1.central.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRL009DPM4MNC90@brm-avmta-1.central.sun.com>; Fri, 16 Oct 2009 02:26:47 -0600 (MDT) Received: from fe-emea-10.sun.com (gmp-eb-lb-1-fe1.eu.sun.com [192.18.6.7] (may be forged)) by gmp-eb-inf-1.sun.com (8.13.7+Sun/8.12.9) with ESMTP id n9G8QkFM006464; Fri, 16 Oct 2009 08:26:46 +0000 (GMT) Received: from conversion-daemon.fe-emea-10.sun.com by fe-emea-10.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) id <0KRL00A00LEIS100@fe-emea-10.sun.com>; Fri, 16 Oct 2009 09:26:36 +0100 (BST) Received: from [192.168.1.105] (cpc2-rdng20-2-0-cust917.15-3.cable.virginmedia.com [86.28.167.150]) by fe-emea-10.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) with ESMTPSA id <0KRL000DSM39J190@fe-emea-10.sun.com>; Fri, 16 Oct 2009 09:25:59 +0100 (BST) Date: Fri, 16 Oct 2009 09:25:56 +0100 From: Darren J Moffat Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <4AD7D9FB.1040100@sun.com> Sender: Darren.Moffat@sun.com To: "Garrett D'Amore" Cc: Scott Rotondo , zfs-team@sun.com, Krishna Yenduri , Lori.Alt@sun.com, PSARC-ext@sun.com, Matthew Ahrens Message-id: <4AD82E14.7000203@Sun.COM> MIME-version: 1.0 Content-type: text/plain; CHARSET=US-ASCII; format=flowed Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> <4AD76748.4070306@Sun.COM> <4AD770D3.2060001@sun.com> <4AD7BFB1.4040008@sun.com> <4AD7D9FB.1040100@sun.com> User-Agent: Thunderbird 2.0.0.22 (X11/20090818) Status: RO Content-Length: 853 Garrett D'Amore wrote: >> In any case, I think it's safe to conclude that SHA-256 is more than >> adequate for filesystem block equality comparisons. > > That's true today. At what point will Moore's law catch up though? > (In other words, how long will it take for storage densities to reach > the point where where the risk of a collision becomes significant?) > Start from a petabyte (probably about the largest practical filesystem > size in use today), and double every 12 months. (I think storage has > been outpacing Moore somewhat.) Which is why ZFS uses an extensible system for specifying checksum, compression, encryption algorithms. The NIST competition for the SHA-3 set of digests is running now and there is expected to be a SHA-3 defined by 2012. http://csrc.nist.gov/groups/ST/hash/timeline.html -- Darren J Moffat From Nicolas.Williams@sun.com Fri Oct 16 09:40:14 2009 Received: from newsunmail1brm.central.sun.com (newsunmail1brm.Central.Sun.COM [129.147.62.245]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9GGeEtO005368 for ; Fri, 16 Oct 2009 09:40:14 -0700 (PDT) Received: from nwk-avmta-1.SFBay.Sun.COM (nwk-avmta-1.SFBay.Sun.COM [129.146.11.74]) by newsunmail1brm.central.sun.com (8.13.7+Sun/8.13.7/ENSMAIL,v2.2) with ESMTP id n9GGe9rn028838; Fri, 16 Oct 2009 10:40:12 -0600 (MDT) Received: from pmxchannel-daemon.nwk-avmta-1.sfbay.Sun.COM by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRM00F078YZDY00@nwk-avmta-1.sfbay.Sun.COM>; Fri, 16 Oct 2009 09:40:11 -0700 (PDT) Received: from binky.Central.Sun.COM ([129.153.128.104]) by nwk-avmta-1.sfbay.Sun.COM (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRM008B98YZ8Q40@nwk-avmta-1.sfbay.Sun.COM>; Fri, 16 Oct 2009 09:40:11 -0700 (PDT) Received: from binky.Central.Sun.COM (localhost [127.0.0.1]) by binky.Central.Sun.COM (8.14.3+Sun/8.14.3) with ESMTP id n9GGaIiC001962; Fri, 16 Oct 2009 11:36:18 -0500 (CDT) Received: (from nw141292@localhost) by binky.Central.Sun.COM (8.14.3+Sun/8.14.3/Submit) id n9GGaG2d001961; Fri, 16 Oct 2009 11:36:16 -0500 (CDT) Date: Fri, 16 Oct 2009 11:36:16 -0500 From: Nicolas Williams Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <4AD7D9FB.1040100@sun.com> To: "Garrett D'Amore" Cc: Scott Rotondo , Krishna Yenduri , Lori.Alt@sun.com, James Carlson , Matthew Ahrens , PSARC-ext@sun.com, zfs-team@sun.com Message-id: <20091016163615.GE892@Sun.COM> MIME-version: 1.0 Content-type: text/plain; charset=us-ascii Content-transfer-encoding: 7BIT Content-disposition: inline X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> <4AD4CFD3.1020703@workingcode.com> <20091013193642.GK887@Sun.COM> <4AD76748.4070306@Sun.COM> <4AD770D3.2060001@sun.com> <4AD7BFB1.4040008@sun.com> <4AD7D9FB.1040100@sun.com> X-Authentication-warning: binky.Central.Sun.COM: nw141292 set sender to Nicolas.Williams@sun.com using -f User-Agent: Mutt/1.5.7i Status: RO Content-Length: 2500 On Thu, Oct 15, 2009 at 07:27:07PM -0700, Garrett D'Amore wrote: > Scott Rotondo wrote: > >Perhaps it's worth pointing out that both statements above are > >correct, but they are answers to different questions. 10^-77 is the > >probability of a hash collision for a particular pair of blocks. For > >ZFS, we care if there is a collision between *any* pair of unequal > >blocks. That probability depends on the number of blocks, as Krishna > >points out. Finally, both of these calculations rely upon the implicit > >assumption that the 2^256 possible hash values are uniformly > >distributed; that assumption is widely accepted to be at least > >approximately true, but I'm not aware of a mathematical proof. > > > >In any case, I think it's safe to conclude that SHA-256 is more than > >adequate for filesystem block equality comparisons. > > That's true today. At what point will Moore's law catch up though? > (In other words, how long will it take for storage densities to reach > the point where where the risk of a collision becomes significant?) > Start from a petabyte (probably about the largest practical filesystem > size in use today), and double every 12 months. (I think storage has > been outpacing Moore somewhat.) It's not. Brute forcing a security system with 128 bits of security, and storing 2^128 bits runs into fundamental physical limits. Still, if you have 2^48 bits of storage the likelihood of pair-wise conflicts with a 256-bit hash is going to be a more that 2^-128: ~ 2^-97 if we assume a block size of 128KB. 2^-97 is still extremely unlikely. If we up the storage amount to 2^64 and block sizes to 1MB we have a 2^-88 probability of collisions. Still comfortable, but if SHA-256 turns out to have weaknesses, then 2^-88 begins to get uncomfortable. Of course, by the time anyone has 2^64 bits of storage we'll have switched to a larger hash function for zfs send streams. The problem for me is not that 128 bits is not enough -- it sure seems like enough. One problem is that we don't know that SHA-256 has a uniform distribution of outputs for any random set of inputs, but let's assume that SHA-256 does. The bigger problem for me is that ZFS had never before used checksums for equality comparison, and I just wanted to make sure that the fact that ZFS would now have one use case of checksums for equality comparison didn't happen by accident. Since the i-team has indicated that this design point is purposeful, I'm done. Nico -- From richard.matthews@sun.com Fri Oct 16 10:28:29 2009 Received: from sunmail5.uk.sun.com (sunmail5.UK.Sun.COM [129.156.85.165]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9GHSSnv007919 for ; Fri, 16 Oct 2009 10:28:28 -0700 (PDT) Received: from nwk-avmta-2.sfbay.sun.com (nwk-avmta-2.SFBay.Sun.COM [129.145.155.6]) by sunmail5.uk.sun.com (8.13.8+Sun/8.13.8/ENSMAIL,v2.2) with ESMTP id n9GHSQSw018696; Fri, 16 Oct 2009 18:28:27 +0100 (BST) Received: from pmxchannel-daemon.nwk-avmta-2.sfbay.sun.com by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRM00J0PB7FWX00@nwk-avmta-2.sfbay.sun.com>; Fri, 16 Oct 2009 10:28:27 -0700 (PDT) Received: from brmea-mail-4.sun.com ([192.18.98.36]) by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRM009GEB7BLNB0@nwk-avmta-2.sfbay.sun.com>; Fri, 16 Oct 2009 10:28:24 -0700 (PDT) Received: from fe-amer-10.sun.com ([192.18.109.80]) by brmea-mail-4.sun.com (8.13.6+Sun/8.12.9) with ESMTP id n9GHSNjX014541; Fri, 16 Oct 2009 17:28:23 +0000 (GMT) Received: from conversion-daemon.mail-amer.sun.com by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) id <0KRM00F00AGVH000@mail-amer.sun.com>; Fri, 16 Oct 2009 11:28:23 -0600 (MDT) Received: from [129.152.9.14] ([unknown] [129.152.9.14]) by mail-amer.sun.com (Sun Java(tm) System Messaging Server 7u2-7.04 64bit (built Jul 2 2009)) with ESMTPSA id <0KRM00DGQB75SA20@mail-amer.sun.com>; Fri, 16 Oct 2009 11:28:18 -0600 (MDT) Date: Fri, 16 Oct 2009 12:28:17 -0500 From: Rick Matthews Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> Sender: richard.matthews@sun.com To: Matthew Ahrens Cc: PSARC-ext@sun.com, zfs-team@sun.com Reply-to: richard.matthews@sun.com Message-id: <4AD8AD31.3010701@Sun.COM> MIME-version: 1.0 Content-type: text/plain; CHARSET=US-ASCII; format=flowed Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> User-Agent: Thunderbird 2.0.0.21 (X11/20090311) Status: RO Content-Length: 985 +1, although it may be implied from other responses. On 10/13/09 01:25 PM, Matthew Ahrens wrote: > Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI > This information is Copyright 2009 Sun Microsystems > 1. Introduction > 1.1. Project/Component Working Name: > ZFS send dedup > 1.2. Name of Document Author/Supplier: > Author: Lori Alt > 1.3 Date of This Document: > 13 October, 2009 > 4. Technical Description > This case requests micro/patch binding; new interfaces are Comitted. > > > -- --------------------------------------------------------------------- Rick Matthews email: Rick.Matthews@sun.com Sun Microsystems, Inc. phone:+1(651) 554-1518 1270 Eagan Industrial Road phone(internal): 54418 Suite 160 fax: +1(651) 554-1540 Eagan, MN 55121-1231 USA main: +1(651) 554-1500 --------------------------------------------------------------------- From ahrens@sun.com Thu Oct 22 13:43:58 2009 Received: from sunmail5.uk.sun.com (sunmail5.UK.Sun.COM [129.156.85.165]) by sac.sfbay.sun.com (8.13.8+Sun/8.13.8) with ESMTP id n9MKhv9S022305 for ; Thu, 22 Oct 2009 13:43:57 -0700 (PDT) Received: from nwk-avmta-2.sfbay.sun.com (nwk-avmta-2.SFBay.Sun.COM [129.145.155.6]) by sunmail5.uk.sun.com (8.13.8+Sun/8.13.8/ENSMAIL,v2.2) with ESMTP id n9MKhklX001697; Thu, 22 Oct 2009 21:43:56 +0100 (BST) Received: from pmxchannel-daemon.nwk-avmta-2.sfbay.sun.com by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) id <0KRX0040RO97FU00@nwk-avmta-2.sfbay.sun.com>; Thu, 22 Oct 2009 13:43:55 -0700 (PDT) Received: from zion.sfbay.sun.com ([129.146.17.75]) by nwk-avmta-2.sfbay.sun.com (Sun Java System Messaging Server 6.2-3.04 (built Jul 15 2005)) with ESMTP id <0KRX00MPKO958B70@nwk-avmta-2.sfbay.sun.com>; Thu, 22 Oct 2009 13:43:53 -0700 (PDT) Received: from matthew-ahrenss-macbook-pro.local (punchin-ahrens.SFBay.Sun.COM [10.7.251.178]) by zion.sfbay.sun.com (8.14.3+Sun/8.14.3) with ESMTP id n9MKgZig008349; Thu, 22 Oct 2009 20:42:37 +0000 (GMT) Date: Thu, 22 Oct 2009 13:43:46 -0700 From: Matthew Ahrens Subject: Re: ZFS send dedup [PSARC/2009/557 FastTrack timeout 10/21/2009] In-reply-to: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> To: Matthew Ahrens Cc: PSARC-ext@sun.com, zfs-team@sun.com Message-id: <4AE0C402.1000308@sun.com> MIME-version: 1.0 Content-type: text/plain; charset=ISO-8859-1; format=flowed Content-transfer-encoding: 7BIT X-PMX-Version: 5.4.1.325704 References: <200910131825.n9DIPSkK028907@localhost.sfbay.sun.com> User-Agent: Thunderbird 2.0.0.23 (Macintosh/20090812) Status: RO Content-Length: 5559 This case was approved at yesterday's meeting. --matt Matthew Ahrens wrote: > Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI > This information is Copyright 2009 Sun Microsystems > 1. Introduction > 1.1. Project/Component Working Name: > ZFS send dedup > 1.2. Name of Document Author/Supplier: > Author: Lori Alt > 1.3 Date of This Document: > 13 October, 2009 > 4. Technical Description > This case requests micro/patch binding; new interfaces are Comitted. > > 4. Technical Description > > OVERVIEW: > > "Dedup" is an overall term for technologies that eliminate duplicate > copies of data in storage or memory. This specific application of > dedup is for ZFS send streams, i.e., the output of the 'zfs send' command. > For some kinds of data, much of the content of a send stream consists > of blocks for which identical copies have already been sent earlier > in the stream. This technology replaces later copies of a block with > a reference to the earlier copy. This can significantly reduce the > size of a send stream, which reduces the time it takes to transfer > such a stream over a communication channel. > > PROPOSED SOLUTION: > > A new '-D' option to 'zfs send' is proposed. This option will cause > dedup processing to be performed on the data being written to a send > stream. Dedup processing is optional because it isn't always appropriate > (some kinds of data have very little duplication) and it has significant > costs: the checksumming required to detect duplicate blocks is > CPU-intensive and the data that must be maintained while the stream > is being processed can occupy a very large amount of memory. > > Duplicate blocks are detected by calculating a cryptographically strong > checksum on each data block. Blocks that have the same checksum are > presumed to be identical. The checksum type used at this time is SHA256. > However, the stream format contains a field which identifies the checksum > type, permitting other checksums to be used in the future. > > RELATION TO OTHER ZFS DEDUP WORK > > There are several other ongoing ZFS projects that are potentially > related to this one: on-disk dedup, in-core dedup, and ZFS > encryption (PSARC/2007/261). The relation between this project > and the other projects is that over-the-wire (OTW) dedup does not depend > on those projects, but will be able to take advantage of some > aspects of the other dedup work when it is integrated. > > Dedup of send streams can be performed regardless of whether the > other variants of dedup are operational. The main way that OTW dedup > can take advantage of the other varieties of dedup support is that > if a dedup-capable checksum of the data has already been calculated, > the 'zfs send' processing will not recalculate it. It will use the > already-computed checksum, thereby reducing the CPU usage of the > stream dedup processing. > > The checksum of block send in dedup'ed streams will be included in > the stream. This gives the receive side of the code the option > to work with the in-core and on-disk dedup support to avoid the > re-computation of the checksum when the data is stored in memory > or on-disk. At this time, that option is not being used (because > in-core and on-disk dedup are still in development), and it might > not ever be used. But the interface has been designed in such a > way to allow that optimization in the future. > > SEND STREAM FORMAT COMPATIBILITY IMPACT > > Over-the-wire dedup support requires a change to the format of > a send stream. A new "write-by-reference" record is used to indicate > a write operation that references data sent earlier in the stream. > > This new record type will only appear in dedup'ed streams. A feature > flag indicating the use of dedup will be set in the streams "begin" > record. Older version of 'zfs receive' will reject the stream as > unreadable because of the presense of that feature flag. However, if > dedup is not being done on the stream, older version of the zfs software > will be able to read the stream (assuming that the objects recorded > in the stream are of a version that can be interpreted by the version > of zfs on the receiving system, but that is an existing requirement, > not one added by this project). > > CHANGES TO THE ZFS(1M) MANPAGE > > 65c62 > < zfs send [-vR] [-[iI] snapshot] snapshot > --- > >>> zfs send [-DvR] [-[iI] snapshot] snapshot >>> > > 1746c1677 > < zfs send [-vR] [-[iI] snapshot] snapshot > --- > >>> zfs send [-DvR] [-[iI] snapshot] snapshot >>> > 1753a1685,1689 > >>> -D >>> Perform dedup processing on the stream. Dedup'ed streams >>> cannot be received on systems that do not support the stream >>> dedup feature. >>> >>> > > ATTRIBUTES > See attributes(5) for descriptions of the following attributes: > > ____________________________________________________________ > | ATTRIBUTE TYPE | ATTRIBUTE VALUE | > |_____________________________|_____________________________| > | Availability | SUNWzfsu | > |_____________________________|_____________________________| > | Interface Stability | Committed | > |_____________________________|_____________________________| > > 6. Resources and Schedule > 6.4. Steering Committee requested information > 6.4.1. Consolidation C-team Name: > ON > 6.5. ARC review type: FastTrack > 6.6. ARC Exposure: open >