May 08, 2008 Archives

Thu May 8 16:35:13 CEST 2008

Gzip, Bzip2 and Lzma compared

Why?

There has recently been a discussion about GNU switching from bzip2 to lzma for their distributed tarballs. They still offer gzip tarballs as an alternative. However, Gentoo has been preferring the bzip2 tarballs mostly due to the improved pack ratio of bzip2.

Unfortunately, the software for lzma is not (yet) as mature as some would like. For example, the format of files produced has changed recently (in a compatible way, though). Also, the current incarnation of the canonical binaries (lzma-utils) by default links against libstdc++.so which is a huge headache for release engineering and the like.

What and How?

How these distribution problems can/will be solved, remains to be seen. What I'm more interested in, is a comparison of the performance of the three packers. I had initially hoped to also compare the amount of I/O done and memory usage, but GNU time let me down there.

GNU time's manpage claims that it can record and output quite a few figures regarding I/O and memory usage. Unfortunately, I have not been able to make time report anything other than 0 for those interesting stats. Not wanting to debug time, I've chosen to do performance tests regarding pack ratio and execution time, instead.

I've run all tests in single user mode, so as to not disturb page caches and the like. All tests run from the page cache, to factor out disk latency and possible fragmentation of files.

Here's the data I used:

  1. lenna.tiff The compression testing poster girl
  2. lenna.png The same image as above, as a PNG, to see what the algorithms make of already-compressed data
  3. linux-2.6.25.tar The sources for the linux kernel, version 2.6.25
  4. wrnpc12.txt Leo Tolstoy: War and Peace. From Project Gutenberg. ASCII Plaintext.
  5. ImageMagick-6.4.0_modules-Q16.tar A tar file /usr/lib64/ImageMagick-6.4.0/modules-Q16/ from my amd64 system, containing many .so, .la and .a files

Results

Red indicates worst, green indicates best result in a category. All times are in seconds, compression ratio is percentage of uncompressed size.

Input data Algo Time (C) Time (D) Ratio
ImageMagick modules gzip 0.30 0.04 25.28%
bzip2 0.61 0.27 20.44%
lzma 4.97 0.10 12.64%
 
Lena (TIFF) gzip 0.05 0.01 93.24%
bzip2 0.15 0.08 74.31%
lzma 0.33 0.06 80.69%
 
Lena (PNG) gzip 0.02 0.00 100.02%
bzip2 0.13 0.05 100.32%
lzma 0.18 0.04 100.67%
 
Linux Sources gzip 12.98 1.88 21.81%
bzip2 54.61 14.25 17.07%
lzma 287.58 5.46 14.47%
 
War and Peace gzip 0.30 0.03 37.00%
bzip2 0.58 0.26 26.96%
lzma 3.30 0.09 28.34%
 
Overall (Average) gzip 2.73 0.39 55.47%
bzip2 11.22 2.98 47.82%
lzma 59.27 1.15 47.36%

Mostly, this benchmark turns out as expected (and advertised by the lzma authors). Gzip is fastest on compression; bzip2 comes second for compression times and lzma finishes last. Decompression is different: gzip is fastest again, but lzma is faster than bzip2, sometimes nearly by a factor of three. Compression ratio is where gzip shows its age: its compression ratio is worst (except when packing already compressed stuff). Bzip2 bests lzma's compression ration in three cases of five (two of four if you don't count the PNG). On average, however, lzma compresses better by a hair. Lzma's strongest lead over the others is when compressing the ImageMagick module tar.

Conclusion

Lzma is definitely worth a look due to its performance - if you unpack much more often than you pack (three times slower than already-not-so-quick bzip2!). Also, lzma works better on large files. From a distribution standpoint, if/when the library/dependency problems have been sorted out, lzma is quite preferable, if you don't do much compression yourself, yet have to watch your bandwidth usage. For Gentoo, it's reasonable - if the kinks are worked out. Another disadvantage of lzma is that while gzip and bzip2 have their own single-char decompression switch when using tar (z and j, repectively), the new guy on the block only gets a long one: --lzma (introduced in tar 1.20). This might sound minor, but it can quickly get pn your nerves. YMMV.

What else to test

I'd love to see more tests with real-life data. I also think a comparison of I/O load and memory usage for the three contenders would be interesting. If somebody wants to do all the work, comparing the different speed/compression settings for the three could be interesting - I have only used the default settings, which might not exactly be fair for gzip, which was written at a time when CPU cycles where far more expensive.

Comments

Josh "Nightmorph" Saddler writes:

Thanks for the blog and the benchies. Some interesting stuff here.

I still don't know that I'd go for lzma, because it's compression time sucks assballs. bzip2 is bad enough as it is. lzma is at the back of the pack for compression, every time.

The numbers for Linux sources aren't very encouraging. lzma takes a metric assload of time to compress it. Far, far too long. At least it's quick to decompress the same file.

I guess if we were to mostly move to lzma archives, it wouldn't be so bad on the package installation side. I mean, most of what we do as Gentoo users is unpack tarballs for installation, right? Anything to speed up this process is probably a step in the right direction ... I just wouldn't use lzma for anything else. For Gentoo, it's fine. For personal stuff, it's not.

and

One thing I didn't even think about was about postinstall compression. Diego mentioned this over at his blog — every merge operation is going to have some compression for things like docs, manpages, etc.

I'd completely forgotten about the stuff that gets compressed when installed. lzma doesn't scale particularly well to small files, not like gzip does.

I guess it's a tradeoff — you can unpack large archives like kernel sources very quickly, but you'll get (a bit?) more time added to each merge op if lzma is doing the postinstall compression.

I absolutely agree with Josh. The maturity problems aside, being a consumer of lzma archives is nice if you're not short on memory. That said, supporting lzma as a method for upstream tarballs is a nice thing to have. I won't be switching to it for my personal compression needs any time soon.

Bonus fact learned about recent tar versions

Recent version of tar autodetect the compression for bzip2 and gzip, which means you can use tar xf linux-2.6.25.tar.bz. Not only that, you can also tell it to guess compression upon creation of archives: tar caf foo.tbz2 /etc. Nifty. Doesn't work for lzma, though, which might be due to the recent change in file format. file doesn't recognize the files, either.

Bonus Bonus fact learned: bash completion for tar needs to be fixed.

I've taken the liberty of inserting the link to Diego's blog post.


Posted by klausman | Permanent Link | Categories: Tools of the trade
Comment by mail to blog@ this domain. Why?