Wednesday, May 19, 2010

PostgreSQL Package Management

Database development sucks, and I would like to fix it. I don't mean database system software development; that's awesome. ;-) I mean database development as in writing the tables, views, functions, and other code that make up your database. We have come further in recent years in PostgreSQL land. We have had the PL/pgSQL debugger, there is Piggly for PL/pgSQL code coverage, we have pgTAP for unit testing support, we have in Post Facto a version control system running inside the database, although I'll try to explain below why I think that that is not the right solution. My problem is that getting database code from the editor to the database server in a safe manner is pretty difficult. This already starts with deploying simple database code to a single server for the first time (as shown in this entry), and gradually gets more complicated when you need to update existing installations, manage multiple servers, or even multiple versions of that code.

My answer to that problem is an old friend: package management. Package managers such as dpkg and rpm are pretty well-established solutions and have shown over the years that managing software deployments can be easy and occasionally even fun.

Consider how the development and deployment of normal software proceeds:
  • You edit some files in your text editor.
  • You execute some kind of build step (could be compilation or more trivial).
  • Maybe you run a test suite.
  • You check your files into version control.
And then you or someone else, perhaps a release manager or system administratior does something like this:
  • Tag a release.
  • Export the tag from version control.
  • Build the software.
  • Run tests.
  • Build a package.
  • Push package to repository.
  • Install/upgrade package on deployment machines.
There are of course sites that do this differently. One could, for example, just copy the built files onto the target machines. But if you do that, then you might as well be happy with the current state of SQL deployment. Many sites are going for continuous deployment nowadays, in which case you might not need a way to track what version of the code is currently running, but you will still want a way to make deployments and upgrades in a safe manner. My premise is that the development cycle of database software should work in pretty much the same way as for other software. But we are only slowly getting the tools to support this work flow. We have editors that allow editing SQL, of course. We don't really need any new build tools; the existing ones appear to serve just fine. We have testing support, getting better all the time. At this point it was once thought that perhaps a database-based version control system such as Post Facto would address the remaining issues. After considering that for a while I have figured out that it doesn't. The main problems with that approach are:
  • Abuses version control system as a deployment tool. Some people deploy code by svn checkout, but most don't.
  • Does not allow for a build step (in a straightforward manner). The database is the source code. You can't generate your database from some XML source, for example.
  • I consider it preferable to put database code and related non-database code in the same version control system, so they can be developed, browsed, deployed, etc. together.
A database-based version control system might actually be the thing for some people, just like other version control systems serve different needs. (Except, note that mine is better than yours! ;-) )

So I think what the world needs is a package manager for databases, that manages database objects just like a package manager for operating systems manages files. And it should work in much the same ways, like this:

  • Conceptually, there are two major modes of operation: building a package and installing a package (think rpmbuild and rpm, or dpkg-buildpackage and dpkg).
  • Building a package is controlled by a special instruction file (think RPM spec file or debian/ directory).
  • Building a package involves running that software's normal build routine and installing it into a temporary staging area. In simple cases, this is running psql -f somefile.sql, but it could be anything. You could have an ORM tool generate the database.
  • The package file is an archive file containing the contents of that staging area plus some metadata.
  • Installing a package unpacks the archive file and places the contents into the designated destination, while sensibly dealing with existing installations.
There are additional features of well-known package managers that could be of interest in a database environment: pre/post scripts for additional setup, relocations (perhaps for selecting installation schemas), integration with a repository manager (like yum or apt).

So I have set out to write such a tool and have arrived at a small prototype. Here is how it works:

$ pgpkg build
 pgpgk/rules clean
make: Nothing to be done for `clean'.
 pgpgk/rules build
make: Nothing to be done for `build'.
 pgpgk/rules binary
psql -X -q -f install_foo.sql
pgpkg create
wrote ../foo_1.0.pgpkg
This build process is currently driven by a Debian-style rules and control file. An RPM-style spec file might also be possible.
$ pgpkg contents ../foo_1.0.pgpkg
{'Client-Encoding': 'UTF8', 'Version': 1.0, 'Format': '0.0', 'Package': 'foo'}
CREATE FUNCTION "public"."test1"(integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 1$$;
CREATE FUNCTION "public"."test2"(integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 2$$;
CREATE FUNCTION "public"."test3"(integer, integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 3$$;

$ pgpkg install -d 'dbname=bar' ../foo_1.0.pgpkg

$ pgpkg remove -d 'dbname=bar' foo
There are more features planned, such as diffing packages, but I haven't gotten to those yet. Also, don't get too excited yet, it only handles a very small subset of SQL objects. Much more code needs to be added to handle everything, but that will mostly be a typing exercise (famous last words).

Ah, but what about this extension packaging business that has been discussed for the last couple of years? Well, I have been staring at the various design documents for that same couple of years now, and it's not going anywhere and it's not really set out to solve my problem. Partially by coincidence, partially by intent, as I'm posting this the guys are sitting in Ottawa discussing extension packaging yet again, so at least take this as a challenge to get something going. :-)

Note that I am calling this thing a "package", because that's what it is. It's not (necessarily) an "extension" (although an extension could be packaged as a package), and it's not a "module" (that term is reserved by SQL; what Oracle calls a package is actually a module).

Also note that this is implemented as an external program, not in the database server. I think the operating system shell is a much more usable and versatile environment. Think of all the things you might want to do: build many packages, compare packages, install some/many/all packages, grep package contents, copy, download, or upload packages. This makes most sense from the shell. If you think about it, neither dpkg nor rpm require extensions in the kernel or the file system to do their work; the same applies here. Another advantage is that the implementation can evolve faster and doesn't need to wait for say PostgreSQL 9.1 to be out.

Anyway, for the moment, forget about extensions. Think about the database application code that you need to move around. Does the above approach make sense? Could it be useful? Ideas, suggestions?

5 comments:

  1. Can you explain further what you mean by "the database is the source code"? If I look at a full-featured DBMS like PG, the "database" seems to me to be more like a combination of two things:

    First, there is the machine-usable schema which resides in the pg_* tables, which roughly corresponds to "built software"; a text file of DDL is really the "source code" in that sense, and it is transient--just a scaffolding to "build" the pg_* tables. Another way to look at it might be that the pg_* tables are "configuration", not "built software", but the line between those two is grey (config can often be seen as merely declarative software). Personally, I'm so used to languages (Python) that allow me to omit a build step, that I'd be interested in imagining ways to omit this separation in DB's too. Ah well, spec compliance would probably forbid that.

    Second, there is live data, which IMO corresponds to "runtime state" of a typical software application, something you didn't really mention. It just happens to be a long and sporadic runtime ;). But unlike most running software, we can't "stop the process" and abandon all our runtime state every time we upgrade. Maybe we should start looking at code systems that allow you to do that (like Smalltalk) and see how they handle versioning.

    ReplyDelete
  2. Hi Peter,

    I think your blog is mostly right on spot, except for a little detail: you're not solving the pg_restore problem that I have with upgrading PostgreSQL and the extensions in there.

    Apart from that, I think that your ideas would complement my work just fine, I'm not working on the packaging of the extension but on their existence in the backend. Approach David Wheeler about this subject, he wants to work on a CPAN like infrastructure atop extensions.

    CPAN would be like source based distribution, what you seem to want is packaging debian and I think it should come atop of that.

    We're not just solving the same problems. Go ahead! :)

    ReplyDelete
  3. We kind of do what you are talking about by using schemas for running our postgres functions.

    Essentially an upgrade checkout/load/test system can run by loading the test schema and then running a set of test queries in transactions against the underlying tables.

    This approach can be made more sophisticated by replaying a plpgsql log back against a test schema + base tables from a specific point in time.

    One could fairly easily promote such a test schema to a production schema after dropping the existing production schema through a programme. We do it by hand at present.

    Updates to base tables (which we keep in a schema called '_data' is done by hand, but since it is typically adding columns this isn't a problem.

    ReplyDelete
  4. @RoryCL

    That sounds like package management by installing into separate directories. That is of course widely practices and works well to some degree, but it's not really a proper solution.

    ReplyDelete