SH/SC Wiki
: DataRecovery
ShscWiki :: LogIn :: PageIndex :: RecentChanges

Gromit's Technical Guide to Partitions/Formatting/Data Recovery

 Table of content 


[if you spot any errors, or care to suggest products or additions, email is gromit@sed8<period>com]
[also, many thanks to the person who Paypalled me some cash because this page saved his ass. You, sir, are a gentleman and a scholar.]

A lot of this will be operating system independent, but if it isn't it will most likely refer to Windows systems, as these are the most common out there in Userland.
In order to keep it simple, I've concentrated on DOS/FAT file systems rather than NTFS, but the general theories should be sound. Just be careful when I talk about file allocation tables and how files are deleted and recovered, as specifics may not be relevant to the file system you are interested in.

Hard drives


You all know what one of those looks like, so I'll not dwell on the physical aspects.
Each hard drive in your PC can have several partitions. With the popularity of larger sizes, multiple partitions are becoming more common.
A partition is a logical drive. That is, whilst you may only have one physical drive (one unit that you can physically hold in your hand), that drive may be broken up into several logical drives. Your operating system will see a number of drives named, for example, C: and D:, but they will all reside on the same physical hard drive.
Get used to the distinction between physical and logical stuff, as it is quite important.

So, how many partitions can you have?

The minimum number is one, of course, for a useable drive you can store stuff on.
However, you are free to create more partitions, limited by the capacity of the disk.
Rather than repeat what you can easily find on the web, there are details on drive partitioning at
http://www.pcguide.com/ref/hdd/file/structPartitions-c.html

When Windows comes to assign drive letters to your partitions, there is a process it follows:
A: and B: are reserved for floppy drives, and the active primary partition (only one can be active) on the first hard disk is assigned C:. Now the first primary partition on each additional hard disk gets the next letter, in sequence. Once the primary partitions on all hard disks have been assigned letters, all the logical partitions on the first hard disk take their turn, followed by all the logical partitions on the second disk, and so on.

Partition Structure


Once you have a partition on a disk, you'll need to know the structure of it. For starters, the partition is split into two distinct logical sections - the data area and the system area.
The data area is where all your files are actually stored. All your operating system files, games, documents etc. (note that there may be some directory information stored here as well, depending on the file system in use.)
The system area is where things like the file allocation tables, directory entries and boot records are kept. If the files are small enough, some filesystems may actually store some of your data in this area. NTFS can do this, for example.
This system area also reduces the available capacity of the hard drive. We've all heard someone complain that their hard drive says 100GB on the label but once formatted, the disk appears much smaller. Well, apart from the gigabyte/gibibyte issue (the difference between 1GB being 1000 million bytes or 1024 million bytes) the file allocation tables take up a lot of space in order to track the disk contents. This is all capacity you can no longer use for your files.

The file allocation tables are a nice big list of the clusters that are allocated to each file. They are explained in the next section.

The master boot record (MBR - created using FDISK) is a small program plus some partition table data that is executed when the system boots up. Typically, the MBR resides on the first sector of the hard drive. This program begins the boot process by looking up the partition table to determine which partition to use for booting. It then transfers program control to the boot sector of that partition, which continues the boot process.
The partition table contains information detailing the structure of the partition(s) on a hard drive and resides in the Master Boot Record. The partition table itself is located at the offset 0x1BE of the first sector of the hard disk. There are four 16-byte entries in the table, each of them being a placeholder for the description of a partition on the hard disk. These entries indicate whether or not the partition is the boot partition, and some addressing values for the partition location, size and type. These entries can be recursive to allow further partitions (this is the process involved in Extended Partitions.)

The volume boot record (VBR) is located on the first sector of every logical drive area allocated on the disk, in both the primary and extended partitions, and contains information about volume size, number of sectors, size of clusters, sectors per cluster and the name of the volume. It also contains code to start the operating system that this partition contains. This code is called by the MBR. The VBR is created by the standard FORMAT command.

Details of the FAT


First of all, let me make an apology. The FAT is often used to generically refer to the file allocation tables, volume boot record and directory structures etc. When I talk about it here, I have broken it down into just the tables themselves. The other stuff is mentioned separately. Keep this in mind if you are looking at other information elsewhere on this subject.

The file allocation table is extremely important. This table shows where all the files on the disk physically reside. Now, seeing as the files on your partitions can easily become fragmented through general use, the "address" of your files is not just one number. There is a huge linked-list for each file that shows which cluster(s) on the disk it uses.

A cluster is a collection of sectors, where a sector is (for all intents and purposes) the smallest section on a disk that you can access. You may have seen terms such as "8k clusters" or "4k clusters". This is simply stating that, for the given partition you have created, each cluster is either 8k or 4k in size, respectively. As each sector is generally 512 bytes in size, you can calculate how many sectors are in each cluster.
The actual size of each cluster (and thus the number of sectors it contains) is determined by the size of your partition and the file system it uses. If the operating system loses track of which clusters are assigned to which files, lost clusters can arise.

Let me give you an example on calculating cluster sizes:
Under FAT-16, the address size is 16 bits. This means the operating system can keep track of 2^16, or 65536 separate clusters. On a 2GB partition, the cluster size is then the partition capacity (2GB * 1024 * 1024 * 1024 = 2147483648 bytes) divided by the number of clusters that can be addressed (65536). This gives us 32768 or 32k.
Under FAT-32, the cluster size is reduced as the address range has increased from 16 bits to 28 bits (not 32 bits, as FAT-32 reserves 4 bits for other things.) With an address space of 2^28, you end up with tiny clusters. Well, the minimum cluster size under FAT-32 is 4k so they don't really get as tiny as the math would lead you to believe.
So when you go to save a file onto your hard drive, it is broken up into as many cluster-sized sections as needed. For example, a 70k file on our previously-calculated 2GB FAT-32 partition would take up 18 clusters. The first 17 clusters would be full (17 * 4k = 68k), and the final one will only be using 2k out of the available 4k. The remaining 2k of this cluster is just wasted. Nothing can be stored in it, as there is no way for the file system to address inside the cluster. Because of this wasted space, you may find (and this is usually the case) that your files take up more space on your hard drive than you think.
You can see from this how FAT-16 will waste more space on your drive. That 70k will only use up 3 of those 32k clusters, but the waste of the last cluster will be 26k. So on the FAT-16 partition, your 70k file will use up as much space as a 96k file.
Particular file sizes will be more wasteful than others, but it is fully determined by the cluster size, which is in turn determined by the partition size and file system used.
Take a look at the properties of some files on your computer and you'll see they have a size and a size on disk. The first is the number of bytes in the file, whereas the second takes account of the whole number of clusters it requires. Files smaller than your cluster size will all take up one whole cluster on your disk.
For example, the shortcut to Unreal Tournament 2004 on my computer is listed as being 1427 bytes in size. The size on disk, however, is 4096 bytes as my clusters are 4k in size. On this partition, no file can take up less than 4096 bytes. If I had an absolute ton of these little files that I was only keeping for archival purposes, I would seriously consider storing them inside a container format such as a ZIP file. Even without compression, being able to store them all in one lump would allow me to recover a large amount of that wasted cluster space.

There are 2 file allocation tables, known as FAT1 and FAT2. FAT2 is a backup of FAT1 and can be used to recover information if FAT1 becomes damaged in some way. I'm yet to find a nice easy tool that will show you the two FATs and let you manipulate and/or copy them over one another, but such a thing may well exist (TestDisk, as listed at the end of this document, might do it but I haven't tested it.) Failing that, you can use a hex editor - the Internet will give you details on where the two FATs reside if you Google for it. Be warned that this can be a little dangerous. Any time you start hacking away at your file tables with a hex editor is a time where you can do some serious damage. Obviously if you ARE going to be doing this, you'd really want to be doing this to a slave drive rather than the drive all your tools and OS are running from.
Note, also, that you may really want to make a good FAT out of undamaged parts of both FATs. Blindly dumping one FAT over the other is not a good idea without checking which you want to keep, or which parts. You really need to analyse their contents before going ahead and overwriting one of you FATs.

Formatting


There are a variety of formatting types available to the user, some of which require the use of third-party tools.

High-level format - this is your standard Windows format, and it comes in two different flavours. The quick format simply resets the system area on the disk. The full format resets the system area and also tests the data area for bad sectors. Note that neither of these formats alter anything in the data area of the disk.
For floppy disks, things are a little different. The quick format is the same, but the full format on a floppy disk will actually overwrites the disk with the character 0xF6 prior to laying down the system area.

Mid-level format - this requires a third-party tool and will overwrite every byte on the disk with a particular character or sequence of characters. Zero is a popular choice, and this is why this format level is often referred to as "zeroing the drive".
Note that this format is often erroneously called a low-level format.
On this subject, a good choice for a mid-level format is something like Darik's Boot and Nuke (DBAN). Download it for free from http://dban.sourceforge.net/ and do a single pass over your target disk. As you can read in the following paragraphs, a single wipe pass is enough to wipe your data for good.

Low-level format - this sets the interleave factor (Google for it - it's not important for you to know) and prepares the disk for a particular type of disk controller. This is generally performed at the factory as it defines the tracks and sectors on the drive. It is the process of outlining the positions of the tracks and sectors on the hard disk and writing the control structures that define where the tracks and sectors are.
Low-level formats were performed on the old MFM specification drives but a user will rarely, if ever, perform this function on an IDE drive. Doing so could render the drive either inoperable or very slow.
Most drive manufacturers have their own utilities available to perform these functions.

MFM drives are pretty old, and usually come in at under 120MB in size. You can usually spot one by noticing that the communications cable is split into two, rather than the single ribbon cable as seen on IDE drives. Pray you never have to work on one of these.

From these descriptions you should be able to see that if only a high-level format has been performed (either quick or full) then your data has remained untouched.
Now, there are some people out there who claim that data can be recovered even from a mid-level format.
A popular paper on this topic is written by Gutmann, and can be found here:
http://www.cs.auckland.ac.nz/~pgut001/pubs/secure_del.html

This document has been examined and a rather good rebuttal on it can be found here:
http://www.nber.org/sys-admin/overwritten-data-guttman.html
I am of the opinion that the arguments put forward by the gentleman in the rebuttal document are accurate. I have yet to locate anyone on the planet who is capable of recovering useful data that has been overwritten. Having said that, there are a number of people in the field who have successfully recovered overwritten data under certain limiting conditions.
Firstly, the person has to know the nature of the data to begin with. I must admit that I'm not certain as to what extent this familiarity must be, but it sounds to me like you must know what the data is in order to make a determination on what it should be when recovered.
Secondly, the process is very slow - of the order of around 1 kilobyte per hour. Work out how long that would take for you to recover all your mp3 files off a 120GB drive.
Finally, this process is only capable of being performed on low-density drives, such as the MFM encoded drives mentioned earlier.
Professor Gomez at the University of Maryland in the US seems to be the leading expert in this area, and he got a mention in New Scientist magazine some years ago. The limitations mentioned earlier pertain to work he has done with scanning tunnelling electron microscopes and/or magnetic force microscopes (that's another MFM acronym that is often confused with the elderly drive type. For drives, it stands for "modified frequency modulation".)

These issues mean that the recovery of overwritten data is impossible in the real world.

[note: since I wrote this article it has been brought to my attention that Guttman has added an epilogue to his report. It basically covers the fact that his analyses were based on older drive technology. He still believes multiple passes are needed for wiping, but tends to put it down to 'a few' rather than 35. He ends with stating 'the chances of an adversary being able to find the erased traces of [some small amount of data] in [, say,] 80GB of other erased traces are close to zero.']

Now some of you are asking "if you can't recover data that has been overwritten just once, why do companies sell software that does multiple overwrites?"
I have an opinion on this, but I can't back it up with any facts. Here it is anyway:
Company A brings out DataDeathstar, a program that will eradicate your rebel files by overwriting them once. This is all you need.
Company B makes a similar product, perhaps without such a copyright-infringing name, but in order to sound better than Company A, they claim they can do multi-pass overwrites. Perhaps they back this decision up with the Gutmann article mentioned earlier.
Now if the cost is the same, Joe User will choose the program with more features - the version that does multi-pass overwrites.
This then precipitates an escalation in the number of wipes any package will perform, to make them sound better than their competitors. Eventually we end up with the Department of Defense 35-pass "standard", or the Bilbo-level Eleventy-billion Insano-wipe.

So why does the Department of Defense specify that huge multi-pass overwrite if one is enough? Once again I can only theorise, as I don't know anyone in that industry who could speak about this topic. Here goes:
Decisions are made by people far above the technical guys on the ground. That is, management types with no techie knowhow. I'm not berating this issue, as it is the same the world over.
At the weekly meeting, one of the subordinate guys points out he read a report from Gutmann about recovering data. It may have mentioned the MFM-issue but that's all techie-speak. The boss decides that he'd rather not risk his career on an issue he can't understand and doesn't have the resources to examine in any depth.
To be safe, he makes sure the standard is some huge amount of overkill, so he can never be determined to be a traitor by allowing data to get into the wrong hands.
This all seems fairly reasonable to me - everyone errs on the side of caution in a field they don't understand.
Also, the military has had loads of data on old MFM technology in their time, and recovery MAY be possible on this gear. Why make multiple standards for different types of drives when your staff may not be able to tell the difference between them?
They also have plenty of manpower, and would be quite happy letting some guys spend their days just wiping data, whether it's a waste of time or not.

Just remember one thing - one overwrite pass is enough to stop anyone recovering your data. If anyone tells you otherwise, tell them to put up or shut up. It's quite simple to get a floppy disk (or hard disk if they prefer), put some files on it and then wipe them so that they can be recovered with some magical system this person says exists. Make it easy for them and tell them what the file types are if you like - it won't help.
There is just too much money to be made in the private sector if some firm were capable of doing overwrite recovery - you would have heard of it being done if it were possible. People often state that perhaps the NSA or US military can do it but aren't telling anyone. Well, those particular organisations outsource all their data recovery to a private company, so the services would be offered to anyone at the right price.

Deleting and Data Recovery


When you delete a file, the only changes that are made are to the system area of the disk - your data remains untouched in the data area.

The system area holds a directory entry for each file. This entry differs slightly between DOS (FAT-12/FAT-16) and Windows (VFAT/FAT-32), so I'll list the details of the Windows entry here. In either case, the entry is 32 bytes in size as follows:

Filename = 8 bytes, extension = 3 bytes, attributes = 1 byte, NT = 1 byte
Creation time = 3 bytes, creation date = 2 bytes, last access date = 2 bytes
Second 12 bits of starting cluster = 2 bytes, time of creation or last update = 2 bytes
Date of creation or modification = 2 bytes, starting cluster number = 2 bytes
File size = 4 bytes


The byte marked "NT" is used by Windows NT to keep track of whether the file name is upper or lower case. I've no idea why they need a special byte to do this, to be honest.

As well as this directory entry, remember that you also have a linked-list of the clusters that contain the file data.

So, when a file is deleted the filename is located in the directory entry and the leading character in the name is changed to the character 0xE5. This character indicates to the system that the directory entry is available for use by a new entry. No other change is made to this 32-byte entry.
The FAT entry is then zeroed out to indicate that this cluster is available for use. Because it's a linked-list, this means all the other clusters become free for use, too.
There are some added steps if the filename is bigger than the standard DOS 8.3 limitation.

Now, if you want to recover this deleted file then you need to do a number of things.
Firstly, locate the file name in the directory entry and alter that first character from 0xE5 to some other legal value. This is why, when you may have run an undelete tool, the first character of your files is changed to something like an underscore. The undelete program has no idea what the true filename should be and so replaced it with some other character. It can, however, determine the correct starting character if the filename is longer than 8.3.
After this, the linked-list of used clusters for that file needs to be chained back together.
Note that there are a few more steps if the file has a name longer than DOS 8.3

Now, let's take a closer look at that phrase "chained back together". The FAT contains a list of all the clusters in use by any particular file. This list of clusters may be very large for big files or a single entry if the file is small enough to fit inside one cluster.
If your partition does not suffer from any sort of fragmentation, then these cluster lists are likely to be sequential in content. That is, the list of used clusters will progress in a linear fashion (for example - 300, 301, 302, 303, ..., 400)
However, if your files are fragmented then the clusters used may jump all over the drive. For file recovery, this is a very bad thing.
You see, the directory entry contains details on the first cluster used by the file. There are two entries in it that track this for you. And when you delete a file this entry remains untouched. So we can look in the directory entry and check out the starting cluster but the FAT has been wiped for this file, so we can't see which clusters are used beyond this first one.
We know the size of the file (in the directory entry) and so can tell how many clusters are in use. Unfortunately, we can only hope that the clusters are sequential and start chaining them together.
If the file is fragmented, then the clusters will NOT be sequential and we will have great difficulty in working out the correct order for the links to be chained back together. Incorrect chaining results in your files starting out as one thing and ending up another.
There may well be algorithms that can be used to determine the likelihood of any given cluster being the correct follow-on from another, but this is probably the sort of thing that makers of data recovery tools keep to themselves.

Extra details on the cluster chains:
The FAT exists as a mapping table to show which clusters are in use for a particular file. Clusters in the FAT are marked as either:
0 - unused
<num> - this number indicates the cluster is in use and it shows the next cluster that the file continues onto. For example, if it is "306" it means that cluster number 306 is the next cluster in the chain.
FFFF - this cluster is the last cluster in the file or the only cluster in a file
F7FF - this cluster is marked BAD and will not be used.
So you have a table of numbers, and you can follow this chain of numbers along to see what clusters are used by the file in question.

Raw Data Recovery


Another method of data recovery avoids the FAT and directory entries entirely. It also relies heavily on the files not being fragmented.
Say you have a JPG image that has been deleted, and for some reason you don't have any directory entries available. For example, your system area may have destroyed itself and your directory entries and FAT are now just so much random data. What can you do?
Well, you can't look for any starting cluster entries or file names/sizes to help you out so instead you have to work on the raw data of the drive.
JPG image files come in a variety of flavours, including EXIF (which is a JPG format that includes metadata.)
Now, all JPG images (or more correctly, JFIF images) have a unique header in them that tells you they are an image file of this format.
The standard JFIF file headers are as follows:

File Type Header in Hexadecimal Notes
Standard JPG FFD8FFE0nnnn4A464946 nnnn varies depending on the file size
EXIF JPG FFD8FFnnnnnn45786966 nnnnnn varies depending on the file size

As you can see, both types of JPG start with the same bytes FFD8FF. So if we wanted to scan the entire hard drive for any and all JPG images, we can tell a piece of software to scan for any occurrences of the string of hexadecimal characters FFD8FF. Once it has located this, we can tell it to go to the section which supplies the JPG image file size (where it says "nnnn(nn)" in the table above) and copy out that much data after the located header. This data is then saved as a binary file on your destination hard drive. If the file was not fragmented, we have successfully "carved out" an image file from our dead drive.
If you have recovered some images before, you may have noticed that some JPGs had only appeared as half an image, or with blocky bands of weird colour in them. This indicates that the file was fragmented in some way and the recovery process therefore missed the other clusters that were elsewhere on the drive.
Some other file types also have unique footers, and we can do a similar thing but only carve out the data found between the header and footer.

So, the likelihood of file recovery is dependant upon a number of conditions:
Firstly, you want the partition to have as little fragmentation as possible. This will increase the likelihood that all the used clusters are sequential. This may be reason enough for you to schedule a regular defrag on your drive.
Secondly, you want the drive to be used as little as possible after the data loss, preferably not at all. The more you move files around or create data, the greater the likelihood that you will overwrite some of the data you want to recover. It doesn't matter how much you use the drive beforehand, but make sure you stop using it once you've suffered a loss of data.
This is a good enough reason as any to have multiple partitions on your hard drives. With a separate partition for all your data, you can have your programs on one partition and even install recovery software there if you need it, without affecting the data drive you want to recover from.
You will also need a partition to save any recovered data to, and this can be that same partition you installed the recovery tools on. Just make sure nothing is saved to the partition you want to recover from, as doing so could overwrite the very data you're trying to salvage.

List of Data Recovery Tools


Here's a handy list of data recovery tools that have been recommended by people:
(Details/prices correct as of January 18, 2006)

http://www.stellarinfo.com/disk-recovery.htm - Stellar Phoenix recovery software. Seperate product for FAT/NTFS, Novell, Linux, CDs, Mac, Flash media. Demo only shows what is recoverable. Cost is from US$79 upwards.
http://www.darepc.com/ - Data Recovery Service - SA-supported and very competitively priced.
http://support.microsoft.com/?kbid=153973 - Microsoft KB article on Recovering NTFS Boot Sector on NTFS Partitions.
http://www.alsoft.com/DiskWarrior/ - Mac disk repair. Cost is US$80. No demo available.
http://www.prosofteng.com/products/index.php - PC and Mac recovery software. Cost is US$99 and up. Mac demo only recovers one file per session. PC demo only shows files that can be recovered.
http://www.ontrack.com/easyrecoveryprofessional/ - OnTrack EasyRecovery Basic and Professional. Demo version lets you see what is recoverable, and will repair ZIP files. Cost is US$89-$200, with commercial pricing up to US$1500.
http://www.lexar.com/software/image_rescue.html - Lexar ImageRescue software for flash media on Mac or PC. Cost is US$30.
http://www.runtime.org/ - GetDataBack for FAT or NTFS and even a RAID Reconstructor. Cost is US$69-$99, with savings if you bundle different filesystem versions. Demo allows you to test file integrity before you need to purchase a licence to recover the files.
http://www.pcinspector.de/file_recovery/UK/welcome.htm - PC Inspector for FAT & NTFS. Freeware. Also have Smart Recovery software for flash media devices.
http://www.smart-projects.net/isobuster/ - ISObuster - CD/DVD data recovery. Free for almost all functionality, and US$26 to register for extra stuff.
http://www.cdroller.com/ - CDroller - CD/DVD data recovery. 14-day evaluation version available. Full version cost is US$30 for 3-year licence.
http://www.handyrecovery.com/index.shtml - Handy Recovery for FAT/NTFS. 30-day trial version available but only recovers 1 file per day. Full version cost is US$30.
http://www.jufsoft.com/badcopy/ - BadCopy Pro - removeable media data recovery. Cost is US$40. Offers your money back if it can't recover your files (rules apply.) Trial version will not recover files.
http://www.active-undelete.com/ - Active@ Undelete for FAT/NTFS. RAID support with Enterprise edition. Demo version will only recover files up to 64k in size. Cost is US$40.
http://www.file-recovery.net/ - Active@ File Recovery for FAT/NTFS. Can be run from a floppy disk. Demo available but it only recovers up to 32k in file size. Home (personal) version costs US$30 and Professional (ie, business licence) is US$49. Site licences available.
http://www.bitmart.net/r2k.shtml - Restorer2000 for FAT/NTFS. Cost is US$30-$50. Trial version only recovers files up to 64k in size.
http://www.r-tt.com/ - R-Studio for FAT/NTFS/Ext2FS. Also does damaged RAID recovery. Cost is US$50-$180. Trial version only recovers files up to 64k in size.
http://www.diydatarecovery.nl/irecover.htm - iRecover for FAT/NTFS. Cost is US$30-$80, discounts apply for multiple filesystem bundles. Demo allows recovery of one full directory only.
http://www.cgsecurity.org/index.html?testdisk.html - Tool to check and undelete partitions. Freeware. FAT12,  FAT16, FAT32, Linux EXT2/EXT3, Linux SWAP (version 1 and 2), NTFS, BeFS (BeOS), UFS (BSD), NSS (Netware), ReiserFS. Runs under Windows or Linux.
http://www.geocities.jp/br_kato/ - Restoration. Freeware. Requires no installation and will run from a floppy disk. Used to recover and/or wipe files.
http://www.acronis.com/homecomputing/products/diskdirector/ - Acronis Disk Director Suite. Cost is US$50. Full suite contains a partition manager, boot manager, disk editor and partition recovery tool. Supports FAT16, FAT32, NTFS, HPFS, Linux Ext2, Ext3, ReiserFS, and Linux Swap. Demo version only lets you see what it can do.
http://www.z-a-recovery.com/digital_image_recovery.htm - Zero Assumption Digital Image Recovery. Freeware. Specifically designed to recover image files from digital camera memory cards.
http://www.quetek.com/ - File Scavenger. NTFS and FAT. "Hard disks, floppy disks, ZIP disks, memory sticks, flash cards, RAIDs, and more." Personal use is US$45, or US$85 if you want to handle RAID-0. Professional use gets at RAID-5 and spanned volumes for US$169. Demo version only recovers files up to 64k in size.
http://foremost.sourceforge.net/ - recovers deleted files from the filesystem. Freeware. "...a console program to recover files based on their headers, footers, and internal data structures. This process is commonly referred to as data carving. Foremost can work on image files, such as those generated by dd, Safeback, Encase, etc, or directly on a drive." Windows binary and unix tar.gz