Database development sucks, and I would like to fix it. I don't mean database system software development; that's awesome. ;-) I mean database development as in writing the tables, views, functions, and other code that make up your database. We have come further in recent years in PostgreSQL land. We have had the
PL/pgSQL debugger, there is
Piggly for PL/pgSQL code coverage, we have
pgTAP for unit testing support, we have in
Post Facto a version control system running inside the database, although I'll try to explain below why I think that that is not the right solution. My problem is that getting database code from the editor to the database server in a safe manner is pretty difficult. This already starts with deploying simple database code to a single server for the first time (as shown in this
entry), and gradually gets more complicated when you need to update existing installations, manage multiple servers, or even multiple versions of that code.
My answer to that problem is an old friend: package management. Package managers such as dpkg and rpm are pretty well-established solutions and have shown over the years that managing software deployments can be easy and occasionally even fun.
Consider how the development and deployment of normal software proceeds:
- You edit some files in your text editor.
- You execute some kind of build step (could be compilation or more trivial).
- Maybe you run a test suite.
- You check your files into version control.
And then you or someone else, perhaps a release manager or system administratior does something like this:
- Tag a release.
- Export the tag from version control.
- Build the software.
- Run tests.
- Build a package.
- Push package to repository.
- Install/upgrade package on deployment machines.
There are of course sites that do this differently. One could, for example, just copy the built files onto the target machines. But if you do that, then you might as well be happy with the current state of SQL deployment. Many sites are going for
continuous deployment nowadays, in which case you might not need a way to track what version of the code is currently running, but you will still want a way to make deployments and upgrades in a safe manner.
My premise is that the development cycle of database software should work in pretty much the same way as for other software. But we are only slowly getting the tools to support this work flow.
We have editors that allow editing SQL, of course. We don't really need any new build tools; the existing ones appear to serve just fine. We have testing support, getting better all the time. At this point it was once thought that perhaps a database-based version control system such as Post Facto would address the remaining issues. After considering that for a while I have figured out that it doesn't. The main problems with that approach are:
- Abuses version control system as a deployment tool. Some people deploy code by svn checkout, but most don't.
- Does not allow for a build step (in a straightforward manner). The database is the source code. You can't generate your database from some XML source, for example.
- I consider it preferable to put database code and related non-database code in the same version control system, so they can be developed, browsed, deployed, etc. together.
A database-based version control system might actually be the thing for some people, just like other version control systems serve different needs. (Except, note that mine is better than yours! ;-) )
So I think what the world needs is a package manager for databases, that manages database objects just like a package manager for operating systems manages files. And it should work in much the same ways, like this:
- Conceptually, there are two major modes of operation: building a package and installing a package (think rpmbuild and rpm, or dpkg-buildpackage and dpkg).
- Building a package is controlled by a special instruction file (think RPM spec file or debian/ directory).
- Building a package involves running that software's normal build routine and installing it into a temporary staging area. In simple cases, this is running psql -f somefile.sql, but it could be anything. You could have an ORM tool generate the database.
- The package file is an archive file containing the contents of that staging area plus some metadata.
- Installing a package unpacks the archive file and places the contents into the designated destination, while sensibly dealing with existing installations.
There are additional features of well-known package managers that could be of interest in a database environment: pre/post scripts for additional setup, relocations (perhaps for selecting installation schemas), integration with a repository manager (like yum or apt).
So I have set out to write such a tool and have arrived at a small prototype. Here is how it works:
$ pgpkg build
pgpgk/rules clean
make: Nothing to be done for `clean'.
pgpgk/rules build
make: Nothing to be done for `build'.
pgpgk/rules binary
psql -X -q -f install_foo.sql
pgpkg create
wrote ../foo_1.0.pgpkg
This build process is currently driven by a Debian-style rules and
control file. An RPM-style spec file might also be possible.
$ pgpkg contents ../foo_1.0.pgpkg
{'Client-Encoding': 'UTF8', 'Version': 1.0, 'Format': '0.0', 'Package': 'foo'}
CREATE FUNCTION "public"."test1"(integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 1$$;
CREATE FUNCTION "public"."test2"(integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 2$$;
CREATE FUNCTION "public"."test3"(integer, integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 3$$;
$ pgpkg install -d 'dbname=bar' ../foo_1.0.pgpkg
$ pgpkg remove -d 'dbname=bar' foo
There are more features planned, such as diffing packages, but I haven't gotten to those yet. Also, don't get too excited yet, it only handles a very small subset of SQL objects. Much more code needs to be added to handle everything, but that will mostly be a typing exercise (famous last words).
Ah, but what about this extension packaging business that has been discussed for the last couple of years? Well, I have been
staring at the various design documents for that same couple of years now, and it's not going anywhere and it's not really set out to solve my problem. Partially by coincidence, partially by intent, as I'm posting this the guys are sitting in Ottawa discussing extension packaging yet again, so at least take this as a challenge to get something going. :-)
Note that I am calling this thing a "package", because that's what it is. It's not (necessarily) an "extension" (although an extension could be packaged as a package), and it's not a "module" (that term is reserved by SQL; what Oracle calls a package is actually a module).
Also note that this is implemented as an external program, not in the database server. I think the operating system shell is a much more usable and versatile environment. Think of all the things you might want to do: build many packages, compare packages, install
some/many/all packages, grep package contents, copy, download, or upload packages. This makes most sense from the shell. If you think about it, neither dpkg nor rpm require extensions in the kernel or the file system to do their work; the same applies here. Another advantage is that the implementation can evolve faster and doesn't need to wait for say PostgreSQL 9.1 to be out.
Anyway, for the moment, forget about extensions. Think about the database application code that you need to move around. Does the above approach make sense? Could it be useful? Ideas, suggestions?