My answer to that problem is an old friend: package management. Package managers such as dpkg and rpm are pretty well-established solutions and have shown over the years that managing software deployments can be easy and occasionally even fun.
Consider how the development and deployment of normal software proceeds:
- You edit some files in your text editor.
- You execute some kind of build step (could be compilation or more trivial).
- Maybe you run a test suite.
- You check your files into version control.
- Tag a release.
- Export the tag from version control.
- Build the software.
- Run tests.
- Build a package.
- Push package to repository.
- Install/upgrade package on deployment machines.
- Abuses version control system as a deployment tool. Some people deploy code by svn checkout, but most don't.
- Does not allow for a build step (in a straightforward manner). The database is the source code. You can't generate your database from some XML source, for example.
- I consider it preferable to put database code and related non-database code in the same version control system, so they can be developed, browsed, deployed, etc. together.
So I think what the world needs is a package manager for databases, that manages database objects just like a package manager for operating systems manages files. And it should work in much the same ways, like this:
- Conceptually, there are two major modes of operation: building a package and installing a package (think rpmbuild and rpm, or dpkg-buildpackage and dpkg).
- Building a package is controlled by a special instruction file (think RPM spec file or debian/ directory).
- Building a package involves running that software's normal build routine and installing it into a temporary staging area. In simple cases, this is running psql -f somefile.sql, but it could be anything. You could have an ORM tool generate the database.
- The package file is an archive file containing the contents of that staging area plus some metadata.
- Installing a package unpacks the archive file and places the contents into the designated destination, while sensibly dealing with existing installations.
So I have set out to write such a tool and have arrived at a small prototype. Here is how it works:
$ pgpkg build pgpgk/rules clean make: Nothing to be done for `clean'. pgpgk/rules build make: Nothing to be done for `build'. pgpgk/rules binary psql -X -q -f install_foo.sql pgpkg create wrote ../foo_1.0.pgpkgThis build process is currently driven by a Debian-style rules and control file. An RPM-style spec file might also be possible.
$ pgpkg contents ../foo_1.0.pgpkg {'Client-Encoding': 'UTF8', 'Version': 1.0, 'Format': '0.0', 'Package': 'foo'} CREATE FUNCTION "public"."test1"(integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 1$$; CREATE FUNCTION "public"."test2"(integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 2$$; CREATE FUNCTION "public"."test3"(integer, integer) RETURNS integer LANGUAGE "sql" VOLATILE AS $$select 3$$; $ pgpkg install -d 'dbname=bar' ../foo_1.0.pgpkg $ pgpkg remove -d 'dbname=bar' fooThere are more features planned, such as diffing packages, but I haven't gotten to those yet. Also, don't get too excited yet, it only handles a very small subset of SQL objects. Much more code needs to be added to handle everything, but that will mostly be a typing exercise (famous last words).
Ah, but what about this extension packaging business that has been discussed for the last couple of years? Well, I have been staring at the various design documents for that same couple of years now, and it's not going anywhere and it's not really set out to solve my problem. Partially by coincidence, partially by intent, as I'm posting this the guys are sitting in Ottawa discussing extension packaging yet again, so at least take this as a challenge to get something going. :-)
Note that I am calling this thing a "package", because that's what it is. It's not (necessarily) an "extension" (although an extension could be packaged as a package), and it's not a "module" (that term is reserved by SQL; what Oracle calls a package is actually a module).
Also note that this is implemented as an external program, not in the database server. I think the operating system shell is a much more usable and versatile environment. Think of all the things you might want to do: build many packages, compare packages, install some/many/all packages, grep package contents, copy, download, or upload packages. This makes most sense from the shell. If you think about it, neither dpkg nor rpm require extensions in the kernel or the file system to do their work; the same applies here. Another advantage is that the implementation can evolve faster and doesn't need to wait for say PostgreSQL 9.1 to be out.
Anyway, for the moment, forget about extensions. Think about the database application code that you need to move around. Does the above approach make sense? Could it be useful? Ideas, suggestions?
Can you explain further what you mean by "the database is the source code"? If I look at a full-featured DBMS like PG, the "database" seems to me to be more like a combination of two things:
ReplyDeleteFirst, there is the machine-usable schema which resides in the pg_* tables, which roughly corresponds to "built software"; a text file of DDL is really the "source code" in that sense, and it is transient--just a scaffolding to "build" the pg_* tables. Another way to look at it might be that the pg_* tables are "configuration", not "built software", but the line between those two is grey (config can often be seen as merely declarative software). Personally, I'm so used to languages (Python) that allow me to omit a build step, that I'd be interested in imagining ways to omit this separation in DB's too. Ah well, spec compliance would probably forbid that.
Second, there is live data, which IMO corresponds to "runtime state" of a typical software application, something you didn't really mention. It just happens to be a long and sporadic runtime ;). But unlike most running software, we can't "stop the process" and abandon all our runtime state every time we upgrade. Maybe we should start looking at code systems that allow you to do that (like Smalltalk) and see how they handle versioning.
Very good post thank you. :)
ReplyDeletePolicy Administration Software
Hi Peter,
ReplyDeleteI think your blog is mostly right on spot, except for a little detail: you're not solving the pg_restore problem that I have with upgrading PostgreSQL and the extensions in there.
Apart from that, I think that your ideas would complement my work just fine, I'm not working on the packaging of the extension but on their existence in the backend. Approach David Wheeler about this subject, he wants to work on a CPAN like infrastructure atop extensions.
CPAN would be like source based distribution, what you seem to want is packaging debian and I think it should come atop of that.
We're not just solving the same problems. Go ahead! :)
We kind of do what you are talking about by using schemas for running our postgres functions.
ReplyDeleteEssentially an upgrade checkout/load/test system can run by loading the test schema and then running a set of test queries in transactions against the underlying tables.
This approach can be made more sophisticated by replaying a plpgsql log back against a test schema + base tables from a specific point in time.
One could fairly easily promote such a test schema to a production schema after dropping the existing production schema through a programme. We do it by hand at present.
Updates to base tables (which we keep in a schema called '_data' is done by hand, but since it is typically adding columns this isn't a problem.
@RoryCL
ReplyDeleteThat sounds like package management by installing into separate directories. That is of course widely practices and works well to some degree, but it's not really a proper solution.