Sunday, May 20, 2012

Base backup compression options

I've been looking at my PostgreSQL base backups. They are run using the traditional

tar -c -z -f basebackup.tar.gz $PGDATA/...

way (many details omitted). I haven't gotten heavily into using pg_basebackup yet, but the following could apply there just as well.

I had found some of the base backups to be pretty slow, so I dug a little deeper. I was surprised to find that the job was completely CPU bound. The blocking factor was the gzip process. So it was worth thinking about other compression options. (The alternative is of course no compression, but that would waste a lot of space.)

There are two ways to approach this:

  • Use a faster compression method.

  • Parallelize the compression.

For a faster compression method, there is lzop, for example. GNU tar has support for that, by using --lzop instead of -z. It gives a pretty good speed improvement, but the compression results are of course worse.

For parallelizing compression, there are parallel (multithreaded) implementations of the well-known gzip and bzip2 compression methods, called pigz and pbzip2, respectively. You can hook these into GNU tar by using the -I option, something like -I pigz. Alternatively, put them into a pipe after tar, so that you can pass them some options. Because otherwise they will bring your system to a screeching halt! If you've never seen a system at a constant 1600% CPU for 10 minutes, try these.

If you have a regular service window or natural slow time at night or on weekends, these tools can be quite useful, because you might be able to cut down the time for your base backup from, say 2 hours to 10 minutes. But if you need to be always on, you will probably want to qualify this a little, by reducing the number of CPUs used for this job. But it can still be pretty effective if you have many CPUs and want to dedicate a couple to the compression task for a while.

Personally, I have settled on pigz as my standard weapon of choice now. It's much faster than pbzip2 and can easily beat single-threaded lzop. Also, it produces standard gzip output, of course, so you don't need to install special tools everywhere, and you can access the file with standard tools in a bind.

Also, consider all of this in the context of restoring. No matter how you take the backup, wouldn't it be nice to be able to restore a backup almost 8 or 16 or 32 times faster?

I have intentionally not included any benchmark numbers here, because it will obviously be pretty site-specific. But it should be easy to test for everyone, and the results should speak for themselves.

3 comments:

  1. try also
    `pxz -3 --threads=$WANTEDCPU`
    it is comparable to gzip in speed a sometimes compress better

    ReplyDelete
  2. pigz is single threaded on decompression, unfortunately. lzop decompression is about twice as fast as gzip here. pbzip2 decompresses in parallel, but the speed isn't very impressive even on low compression levels.

    For very impressive compression/decompression speeds with moderate compression similar to lzop there's quicklz http://www.quicklz.com/ - Percona uses this for their Xtrabackup MySQL backup utility. It's even less standard than lzop and the tools are very primitive - but it's quite fast. :)

    ReplyDelete
  3. Have a look at http://itwik.wordpress.com/2010/04/30/speeding-up-postgres-onlinebackup-compression/, I guess the basic principle would apply to pg_dump(all) aswell.

    ReplyDelete