I've been looking at my PostgreSQL base backups. They are run using the traditional
tar -c -z -f basebackup.tar.gz $PGDATA/...
way (many details omitted). I haven't gotten heavily into using
pg_basebackup
yet, but the following could apply there just as well.
I had found some of the base backups to be pretty slow, so I dug a
little deeper. I was surprised to find that the job was completely
CPU bound. The blocking factor was the gzip
process. So it was
worth thinking about other compression options. (The alternative is
of course no compression, but that would waste a lot of space.)
There are two ways to approach this:
Use a faster compression method.
Parallelize the compression.
For a faster compression method, there is lzop
, for example. GNU
tar
has support for that, by using --lzop
instead of -z
. It
gives a pretty good speed improvement, but the compression results are
of course worse.
For parallelizing compression, there are parallel (multithreaded)
implementations of the well-known gzip
and bzip2
compression
methods, called pigz
and pbzip2
, respectively. You can hook these
into GNU tar
by using the -I
option, something like -I pigz
.
Alternatively, put them into a pipe after tar
, so that you can pass
them some options. Because otherwise they will bring your system to a
screeching halt! If you've never seen a system at a constant 1600%
CPU for 10 minutes, try these.
If you have a regular service window or natural slow time at night or on weekends, these tools can be quite useful, because you might be able to cut down the time for your base backup from, say 2 hours to 10 minutes. But if you need to be always on, you will probably want to qualify this a little, by reducing the number of CPUs used for this job. But it can still be pretty effective if you have many CPUs and want to dedicate a couple to the compression task for a while.
Personally, I have settled on pigz
as my standard weapon of choice
now. It's much faster than pbzip2
and can easily beat
single-threaded lzop
. Also, it produces standard gzip
output, of
course, so you don't need to install special tools everywhere, and you
can access the file with standard tools in a bind.
Also, consider all of this in the context of restoring. No matter how you take the backup, wouldn't it be nice to be able to restore a backup almost 8 or 16 or 32 times faster?
I have intentionally not included any benchmark numbers here, because it will obviously be pretty site-specific. But it should be easy to test for everyone, and the results should speak for themselves.