Tuesday, May 15, 2012

My (anti-)take on database schema version management

There were a number of posts recently about managing schema versions and schema deployment in PostgreSQL. I have analyzed these with great interest, but I have concluded that they are all more or less significantly flawed. (Of course, most of these solutions do in fact work for someone, but they are not general enough to become canonical go-to solutions for this problem class, in my opinion.) I have developed a list of elimination criteria by which I can evaluate future developments in this area. So here are some of the things that I don't want in my schema version management system:

  • Using schemas for distinguishing multiple versions (like this, but that's actually more about API versioning). That simply won't work for deploying objects that are not in schemas, such as casts, languages, extensions, and, well, schemas.

  • Using extensions (like this). Well, this could work. But extensions by themselves do nothing about the core problem. They are just an SQL wrapper interface around upgrade scripts. You still need to write the upgrade scripts, order them, test them, package them. The extension mechanism might replace the, say, shell script that would otherwise run the upgrade files in a suitable order. Another issue is that extensions require access to the server file system. Changing this is being discussed as "inline extensions", but there is no consensus. This is a smaller problem, but it needs to be thought about. Also, I do need to support PostgreSQL 9.0 and earlier for little while more.

  • Requiring naming each change (patch names, like this). Naming things is hard. Numbering things is easy. And how many good names are you going to still be able to come up with after 100 or so changes?

    Take a lesson from file version control systems: versions are numbers or, if it must be, hashes or the like (UUIDs have been suggested).

  • Using a version control tool for tracking upgrade paths (like this). Sqitch, unlike the initial draft of this concept, doesn't actually require a version control tool for deployment, which wouldn't have worked for me, because what we ship is a tarball or a deb/rpm-type package. But it still requires you to maintain some kind of sanity in your version control history so that the tool can make sense out of it. That sounds fragile and inconvenient. The other choice appears to be writing the plan files manually without any VCS involvement, but then this loses much of the apparent appeal of this tool, and it's really no better than the "naming each change" approach mentioned above.

  • Taking snapshots or the like of a production or staging or central development system. Production systems and staging systems should be off limits for this sort of thing. Central development systems don't exist, because with distributed version control, every developer has their own setups, branches, deployments, and world views.

    You could set it up so that every developer gets their own test database, sets up the schema there, takes a dump, and checks that in. There are going to be problems with that, including that dumps produced by pg_dump are ugly and optimized for restoring, not for developing with, and they don't have a deterministic output order.

  • Storing the database source code in a different place or in a different manner than the rest of the source code. This includes using a version control system like mentioned above (meaning storing part of the information in the version control meta information rather than in the files that are checked into the version control system in the normal way), using a separate repository like Post Facto, or using something like the mentioned staging server.

    The source is the source, and it must be possible to check out, edit, build, test, and deploy everything in a uniform and customary manner.

  • Allowing lattice-like dependencies between change sets (like most examples cited above). This sounds great on paper, especially if you want to support distributed development in branches. But then you can have conflicts, for example where two upgrades add a column to the same table. Depending on the upgrade path, you end up with different results. As your change graph grows, you will have an exploding number of possible upgrade paths that will need to be tested.

    There needs to be an unambiguous, canonical state of the database schema for a given source checkout.

  • Requiring running through all the upgrade scripts for a fresh deployment (like this). There are two problems with this. First, it's probably going to be very slow. Fresh deployments need to be reasonably fast, because they will be needed for automated tests, including unit tests, where you don't want to wait for minutes to set up the basic schema. Second, it's inefficient. Over time, you might drop columns, add new columns, delete rows, migrate them to different tables, etc. If you run through all those upgrade scripts, then a supposedly fresh database will already contain a bunch of rubble, dropped attributes, dead rows, and the like.

    Therefore, the current version needs to be deployable from a script that will not end up replaying history.

  • Using metalanguages or abstraction layers (like Pyrseas or Liquibase or any of the metaformats included in various web frameworks). It'd probably a good idea to check some of these out for simple applications. But my concern is whether using an abstraction layer would prevent me from using certain features. For example, look at the Pyrseas feature matrix. It's pretty impressive. But it doesn't support extensions, roles, or grants. So (going by that list), I can't use it. (It's being worked on.) And in a previous version, when I looked at it for a previous project, it didn't support foreign-data functionality, so I couldn't use it then either. And those are just the top-level things the author thought of. Actually, the Pyrseas author has gone through some effort to have almost complete coverage of PostgreSQL DDL features, so give this tool a try. But it won't be for everyone.

So, I'm going to keep looking.

9 comments:

  1. Thanks for the summary (I never imagined there were so many variations on this theme) and, as the Pyseas author, thank you for the vote of confidence. I've been distracted by the Pyrseas "Extender" subproject, otherwise support for PostgreSQL EXTENSIONs would've been done by now. If I may ask, I see that GRANTs (and object owners) are important also, but how crucial is it for you to create/modify ROLEs, which as you know span databases?

    ReplyDelete
  2. This happens to be my favorite database subject, I've been spending 10 years trying to get this right. Two years ago I feel I finally solved the problem. And it's very simple.

    I totally agree with:
    "dumps produced by pg_dump are ugly and optimized for restoring, not for developing with, and they don't have a deterministic output order."

    That is why I added the --split option to pg_dump, which lets you make a plain text schema dump, where each table/sequence/view/function/trigger/etc is written into its own file, e.g:
    ./dumpfile-split/public/FUNCTION/new_user.sql
    ./dumpfile-split/public/TABLE/users.sql
    ./dumpfile-split/someschema/TABLE/accounts.sql

    Revision control software systems, like git, work on files in directories, which is a perfect fit for this solution.

    I have a cronjob at each database, currently production, test and development, which every minute automatically commits any eventual changes to the schema, this is what is looks like:

    $ cat /home/postgres/git_push_schema.sh
    #!/bin/sh
    echo Deleting old
    rm /home/postgres/database/gluepay
    rm -rf /home/postgres/database/gluepay-split
    echo Dumping
    cd /home/postgres/database
    /drbd/r0/opt/postgres-8.4.1/bin/pg_dumpall -r > gluepay.roles
    /usr/local/bin/pg_dump_split -f /home/postgres/database/gluepay -F p --split -s gluepay
    /usr/bin/git checkout production
    /usr/bin/git add -u
    /usr/bin/git add .
    /usr/bin/git commit -m "Update of database schema `uname -a` `date -R`"
    /usr/bin/git push origin production

    If nothing has been changed, nothing will be commited, super simple.

    Thanks to having each version of the database in its own git branch, it's trivial to do a nice git diff between two branches.
    You can also do a diff between two revisions of your production database, to quickly check what has changed between day 1 and day 2.

    The pg_dump --split patch is totally 102 lines of added code to pg_dump, including comments.

    http://archives.postgresql.org/pgsql-hackers/2010-12/msg02285.php

    Hopefully we can convince Tom Lane it would be a good idea to add this feature to pg_dump. I think you have a lot of good points in your post. There is simply no sane way to do this, except for this way as I see it, but that would require fixing pg_dump.

    I've been successfully using this method for two years, and we have a huge schema with changes every day, works like a charm.

    On the subject of deployments, you might want to check out a simple function I wrote https://github.com/gluefinance/PgDeploy
    If you have ever \df+ a stored procedure in production, before replacing it with a newer version, just to make sure the version you worked on in your repo was actually the same as in production, you will like this little tool.
    It's allow you to get a proper diff of the function in the database, and the new version you are deploying.
    I've been using this tool for more than 4 years without a single bug. It's just a few hundred lines of code.

    I would be more than happy to discuss this further with you, sounds like we both are motivated enough to develop a nice standard mainstream solution to the problem.

    /Joel Jacobson or

    ReplyDelete
    Replies
    1. What do you do when all the developers have local databases? Aren't the dev/qa/prod databases on different branches? Do you merge prod changes into all the other branches?

      Delete
  3. Forgot my email address, its joel@ at rustly.com

    ReplyDelete
    Replies
    1. Err, should be joel @ trustly.com, feel free to drop me and email to discuss version controlling, I think using git together with the pg_dump --split option is a really nice way. I've been using it for 2 years now, works like a charm.

      Delete
  4. "dumps produced by pg_dump are ugly and optimized for restoring, not for developing with, and they don't have a deterministic output order."

    Solution to the problem:

    http://archives.postgresql.org/pgsql-hackers/2010-12/msg02285.php

    I'm using git to version control the schema, works great when all objects are written to separate files by pg_dump thanks to this patch adding the --split option.

    ReplyDelete
    Replies
    1. I am more working with SQL Server and I am "shocked" that PostgreSQL supports generating schema dumps with a standard command line tool. In MSSQL you need to use complex programming APIs to achieve that.

      Delete
  5. Yes, I am also searching for something usable for versioning not API, but data and schema changes. Thank you for the effort of reviewing :)

    ReplyDelete
  6. I disagree with the "Requiring running through all the upgrade scripts for a fresh deployment" point. For unit tests a simple create script is a better solution, but for production deployments the "Requiring running through all the upgrade scripts for a fresh deployment" is the only SAFE option. If you don't maintain that script from early on, it's likely that it is not working for EVERY published database version. And that would be a desaster.

    ReplyDelete