Thursday, October 29, 2009

A History of Tarballs

I have been maintaining the autoconfigury of PostgreSQL for many years now, and every once in a while I go to ftp://ftp.gnu.org/gnu/autoconf/ to check out a new version of Autoconf. That FTP listing is actually an interesting tale of how tarball creation practices have evolved over the years.

Obviously, .tar.gz has been the standard all along. Some projects have now completely abandoned .tar.gz in favor of .tar.bz2, but those are rare. I think most ship both now. The FTP listing goes back to 1996; the first .tar.bz2 was shipped in 2001.

RPM-based distributions have switched to supporting and then requiring bzip2-compressed tarballs many years ago. Debian might start supporting that with the next release. So if you want to be able to trace your pristine tarballs throughout the popular Linux distributions, shipping both is best.

One thing that was really popular back then but is almost forgotten now is providing patches between versions, like autoconf-2.12-2.13.diff.gz. The Linux kernel still does that. Autoconf stopped doing that in 1999, when it was replaced by xdelta. Anyone remember that? This lasted until 2002 and was briefly revived in 2008. I think shipping xdeltas is also obsolete now except possibly for huge projects.

In 2003, they started signing releases. First with ASCII-armored signatures (.asc), now with binary signatures (.sig). The Linux kernel also does this, except they call the ASCII-armored signatures .sign.

In 2008, we saw the latest invention, LZMA-compressed tarballs (.tar.lzma). They appear to compress better than bzip2 by about as much as bzip2 wins over gzip. But, this one's already obsolete because it was replaced in 2009 by LZMA2, which goes by the file extension .tar.xz. Some "early adopters" such as Debian's packaging tool dpkg are in the process of adding xz support in addition to the short-lived lzma support.

Throughout all this, interestingly, tar hasn't changed a bit. Well, there are various incompatible extended tar formats around, but when this becomes a problem, people tend to revert to GNU tar.

GNU tar, by the way, supports all the above compression formats internally. gzip is -z, bzip2 is -j, lzma is, well, --lzma, and xz is -J.  And Automake supports creating all these different formats for source code distributions.

6 comments:

  1. Err.. RPMs don't *require* bzip2.

    ReplyDelete
  2. bzip2 is irrelevant, gzip is a golden classic. Why? Gzip is a really good compromise -- fast with reasonable compression.

    Bzip2 is just one of the many compressors that try to compress as well as possible. These are obsobleted every few years by the next "large-effort" compressor, and your LZMA example shows that very well.

    ReplyDelete
  3. RPM doesn't require it, but some RPM-based distributions require it per their policy.

    ReplyDelete
  4. Thanks for the heads-up on LZMA. I read some of the benchmarks and statistics about it, and it appears to be an interesting algorithm. I've written some hypothesis about its energy consumption properties...

    http://blog.maz.nu/post/227826865/eco-friendly-compression

    It'd be interesting to see if LZMA is actually a more efficient algorithm than bzip2 when looking at energy costs. I might have to do a benchmark looking at exactly this!

    ReplyDelete
  5. It's interesting that the archive format hasn't changed much. It's a shame, as the lack of internal compression (forcing you to decompress the entire archive to peek at it) combined with the lack of a TOC (forcing you to seek through the entire archive to derive a list of contents) are serious limitations. I'm tempted to use ZIP only for future releases of my software.

    ReplyDelete
  6. Yeah, zip is actually not bad. In the Linux world, zip tends to reek of Windows, but it actually has its merits.

    ReplyDelete