Tuesday, April 1, 2014
PostgreSQL trash can
Wednesday, September 25, 2013
Design by committee
ALTER TABLE tbl OWNER TO somethingbut
ALTER TABLE tbl SET SCHEMA somethingin PostgreSQL.
Maybe a committee faced with this inconsistency would arrive at the compromise
ALTER TABLE tbl [SET] {OWNER|SCHEMA} [TO] something?
Wednesday, August 28, 2013
Testing PostgreSQL extensions on Travis CI revisited
clkao/plv8js
, which uses an environment variable matrix to control which version to use. This makes things much easier to manage and actually fires off parallel builds, so it's also faster. I've added this to all my repositories for PostgreSQL extensions now. (See some examples: pglibuuid, plxslt, pgvihash, pgpcre, plsh)
Automating patch review
I think there are two kinds of software development organizations (commercial or open source):
Those who don’t do code review.
Those who are struggling to keep up with code review.
PostgreSQL is firmly in the second category. We never finish commit fests on time, and lack of reviewer resources is frequently mentioned as one of the main reasons.
One way to address this problem is to recruit more reviewer resources. That has been tried; it’s difficult. The other way is to reduce the required reviewer resources. We can do this by automating things a little bit.
So I came up with a bag of tools that does the following:
Extract the patches from the commit fest into Git.
Run those patches through an automated test suite.
The first part is done by my script commitfest_branches
. It extracts the email message ID for the latest
patch version of each commit fest submission (either from the database or the RSS feed). From the message ID, it downloads the raw email message and
extracts the actual patch file. Then that patch is applied to the Git
repository in a separate branch. This might fail, in which case I
report that back. At the end, I have a Git repository with one branch
per commit fest patch submission. A copy of that Git repository is
made available here: https://github.com/petere/postgresql-commitfest.
The second part is done by my Jenkins instance, which I have written about before. It runs the same job as it runs with the normal Git master branch, but over all the branches created for the commit fest. At the end, you get a build report for each commit fest submission. See the results here: http://pgci.eisentraut.org/jenkins/view/PostgreSQL/job/postgresql_commitfest_world/. You’ll see that a number of patches had issues. Most were compiler warnings, a few had documentation build issues, a couple had genuine build failures. Several (older) patches failed to apply. Those don’t show up in Jenkins at all.
This is not tied to Jenkins, however. You can run any other build automation against that Git repository, too, of course.
There is still some manual steps required. In particular,
commitfest_branches
needs to be run and the build reports need to be
reported back manually. Fiddling with all those branches is
error-prone. But overall, this is much less work than manually
downloading and building all the patch submissions.
My goal is that by the time a reviewer gets to a patch, it is ensured that the patch applies, builds, and passes the tests. Then the reviewer can concentrate on validating the purpose of the patch and checking the architectural decisions.
What needs to happen next:
I’d like an easier way to post feedback. Given a message ID for the original patch submission, I need to fire off a reply email that properly attaches to the original thread. I don’t have an easy way to do that.
Those reply emails would then need to be registered in the commit fest application. Too much work.
There is another component to this work flow that I have not finalized: checking regularly whether the patches still apply against the master branch.
More automated tests need to be added. This is well understood and a much bigger problem.
In the meantime, I hope this is going to be useful. Let me know if you have suggestions, or send me pull requests on GitHub.
Wednesday, July 17, 2013
Testing PostgreSQL extensions on Travis CI
Tuesday, July 16, 2013
Tricky shell local variables
I have a word of warning against improper use of local
in shell functions.
If you are using shell functions, you might want to declare some variables local to the shell function. That is good. The basic syntax for that is
local a b c
In some shells, you can also combine the local
declaration and
assignment, like this:
local foo=$1
local bar=$2
(The Debian policy even explicitly allows it.)
This is somewhat dangerous.
Bare shell assignment like
foo=$bar
does not perform word splitting, so the above is safe even if there
are spaces in $bar
. But the local
command does perform
word splitting (because it can take multiple arguments, as in the
first example), so the seemingly similar
local foo=$bar
is not safe.
This can be really confusing when you add local
to existing code and
it starts breaking.
You can avoid this, of course, by always quoting everything to like
local foo="$bar"
but overquoting isn't always desirable, because it can make code less readable when commands are nested, like
local foo="$(otherfunc "other arg")"
(Nesting is legal and works fine in this case, however.)
I suggest using local
only for declaring variables, and using
separate assignment statements. That way, all assignments are parsed
in the same way.
Tuesday, June 18, 2013
Autopex: PostgreSQL extension installation magic
CREATE EXTENSION plproxy
, and it will transparently download and build plproxy for you. (Actually, this only works if the extension name is the same as the package name. I'm planning to fix that.)
Note 1: You can't install Autopex via Pex, yet.
Note 2: I guess the next logical step would be Autoautopex, which installs Autopex and Pex automatically somehow. Patches welcome.
I suppose with logical replication, this might actually end up installing the extension code on the replication slaves as well. That would be pretty neat.
Wednesday, May 1, 2013
Moving to C++
GCC 4.8 was recently released. This is the first GCC release that is written in C++ instead of C. Which got me thinking ...
Would this make sense for PostgreSQL?
I think it's worth a closer look.
Much of GCC's job isn't actually that much different from PostgreSQL. It parses language input, optimizes it, and produces some output. It doesn't have a storage layer, it just produces code that someone else runs. Also note that Clang and LLVM are written in C++. I think it would be fair to say that these folks are pretty well informed about selecting a programming language for their job.
It has become apparent to me that C is approaching a dead end. Microsoft isn't updating their compiler to C99, advising people to move to C++ instead. So as long as PostgreSQL (or any other project, for that matter) wants to support that compiler, they will be stuck on C89 forever. That's a long time. We have been carefully introducing the odd post-C89 feature, guarded by configure checks and #ifdefs, but that will either come to an end, or the range of compilers that actually get the full benefit of the code will become narrower and narrower.
C++ on the other hand is still a vibrant language. New standards come out and get adopted by compiler writers. You know how some people require Java 7 or Python 2.7 or Ruby 1.9 for their code? You wish you could have that sort of problem for your C code! With C++ you reasonably might.
I'm also sensing that at this point there are more C++ programmers than C programmers in the world. So using C++ might help grow the project better. (Under the same theory that supporting Windows natively would attract hordes of Windows programmers to the project, which probably did not happen.)
Moving to C++ wouldn't mean that you'd have to rewrite all your code as classes or that you'd have to enter template hell. You could initially consider a C++ compiler a pickier C compiler, and introduce new language features one by one, as you had done before.
Most things that C++ is picky about are things that a C programmer might appreciate anyway. For example, it refuses implicit conversions between void pointers and other pointers, or intermixing different enums. Actually, if you review various design discussions about the behavior of SQL-level types, functions, and type casts in PostgreSQL, PostgreSQL users and developers generally lean on the side of a strict type system. C++ appears to be much more in line with that thinking.
There are also a number of obvious areas where having the richer language and the richer standard library of C++ would simplify coding, reduce repetition, and avoid bugs: memory and string handling; container types such as lists and hash tables; fewer macros necessary; the node management in the backend screams class hierarchy; things like xlog numbers could be types with operators; careful use of function overloading could simplify some complicated internal APIs. There are more. Everyone probably has their own pet peeve here.
I was looking for evidence of this C++ conversion in the GCC source
code, and it's not straightforward to find. As a random example,
consider
gimple.c
.
It looks like a normal C source file at first glance. It is named
.c
after all. But it actually uses C++ features (exercise for the
reader to find them), and the build process compiles it using a C++
compiler.
LWN has an article about how GCC moved to C++.
Thoughts?
Tuesday, April 2, 2013
Installing multiple PostgreSQL versions on Homebrew
I was going to post this yesterday, but some might have thought that it was a joke. April 1st is always an annoying day to communicate real information.
If you have been fond of the way Debian and Ubuntu manage multiple PostgreSQL versions in parallel, you can now have the same on OS X with Homebrew:
brew tap petere/postgresql
brew install postgresql-9.2
# etc.
brew install --HEAD postgresql-common
postgresql-common
is the same code as in Debian, only mangled a little.
Now you have all the client programs symlinked through pg_wrapper
, and
you can use the server management tools such as:
pg_createcluster 9.2 main
pg_ctlcluster 9.2 main start
pg_lsclusters
Let me know if you find this useful.
Links:Sunday, February 17, 2013
Lists and Tuples
(Yes, there are some details omitted here, such as that since a tuple is immutable, it is hashable and can be used as a dictionary key. But I think that is used fairly seldomly.)
Then I came across Haskell and it dawned on me: Was this just a poorly mangled feature from Haskell? I don't know the history, but it looks a bit like it. You see, Haskell also has list and tuples. Lists are delimited by square brackets, and tuples are delimited by parentheses:
let alist = [1, 2, 3] let atuple = (1, 2, 3)(Technically, in Python, tuples are not delimited by parentheses, but they often appear that way.) But the difference is that Haskell does not use parentheses for any other purpose, such as delimiting function arguments. It uses spaces for that. (So it looks more like a shell script at times.)
Python: len([1, 2, 3]) Haskell: length [1, 2, 3]But in Haskell, tuples are not mutable lists and lists are not mutable tuples. Tuples and lists are quite different but complementary things. A list can only contain elements of the same type. So you can have lists
[1, 2, 3, 4, 5] ["a", "b", "c", "d"]but not
[1, 2, "a"]A tuple, on the other hand, can contain values of different types
(1, 2, "a") (3, 4, "b")A particular type combination in a tuple creates a new type on the fly, which becomes interesting when you embed tuples in a list. So you can have a list
[(1, 2, "a"), (3, 4, "b")]but not
[(1, 2, "a"), (3, 4, 5)]Because Haskell is statically typed, it can verify this at compile time.
If you think in terms of relational databases, the term tuple in particular makes a lot of sense in this way. A result set from a database query would be a list of tuples.
The arrival of the namedtuple also supports the notion that tuples should be thought of as combining several pieces of data of different natures, but of course this is not enforced in either tuples or named tuples.
Now, none of this is relevant to Python. Because of duck typing, a database query result set might as well be a list of lists or a tuple of tuples or something different altogether that emulates sequences. But I found it useful to understand where this syntax and terminology might have come from.
Looking at the newer classes set and frozenset, it might also help to sometimes think of a tuple as a frozenlist
instead, because this is closer to the role it plays in Python.
Thursday, February 14, 2013
pgindent Jenkins job
previewof what pgindent would do with the current source (
pgindent.diff
), which can be educational or terribly confusing.
Friday, February 1, 2013
Introducing the Pex package manager for PostgreSQL
pex. It's targeted at developers, allows easy customization, and supports multiple PostgreSQL installations.
Here is how it works:
Installation:
git clone git://github.com/petere/pex.git cd pex sudo make install
Install some packages:
pex init pex install plproxy pex search hash pex install pghashlib
Multiple PostgreSQL installations:
pex -g /usr/local/pgsql2 install plproxy pex -p 5433 install pghashlib
Upgrade:
pex update pex upgrade
It works a bit like Homebrew, except that it doesn't use Ruby or a lot of metaphors. ;-)
Check it out at https://github.com/petere/pex.
Tuesday, January 1, 2013
PostgreSQL and Jenkins
So I have set this up now: http://pgci.eisentraut.org/jenkins/
It's already been very helpful during the last couple of weeks that I've run this. The main point behind the effort is to automate things. These are things I do just about every day and won't have to anymore:
- build PostgreSQL
- check for compiler warnings
- run various test suites
- do this for all supported branches
- check distribution building (
make distcheck
) - test build of additional documentation formats
cpluspluscheck
- check external web links in the documentation (The job for that currently appears to be reporting false positives. Use with caution.)
- test coverage reporting
Actually, many of the checks I had set up immediately found problems: newly introduced compiler warnings, secondary documentation format builds broken, cpluspluscheck failing, broken links in the HTML documentation, various extensions no longer build with Postg reSQL 9.3devel.
But there is more cool stuff:
- There are various RSS feeds for all builds or failed buids.
- You can interact with the system on mobile devices. I use JenkinsMobi for iOS.
- You can get up to date documentation builds on a more predictable schedule.
The one thing (just about) it doesn't do is test operating system and CPU architecture portability. Jenkins comes from a Java background, where this isn't much of an issue, and so there isn't good built-in support for that sort of thing. But anyway, we have the build farm for that.
You can get the code at http://bitbucket.org/petere/pgci. The entire setup is automated with Puppet. You can fork it and set up your own (or send me your changes), or you can run it locally using Vagrant (which is what I do to test changes).
If you have any ideas, let me know (file an issue on Bitbucket). I have plans for a number of enhancements already, foremost pg_upgrade testing. Also, let me know if there are additional extensions you want tested. I have just put in a few I use myself at the moment, but other can easily be added.
Happy New Year!
Monday, October 1, 2012
psqlrc files
.psqlrc
files, such as .psqlrc-9.2
. PostgreSQL 9.2 also added the "include relative" command \ir
to psql. Combining these two, you can set up psql initialization to take advantage of any new features you want without breaking the use of old psql releases.
For example, I'd like to set up psql to automatically use \x auto
. But if I just put that into .psqlrc
, older psql releases will complain about an unknown command. (I usually have multiple PostgreSQL versions installed, and I share dotfiles across hosts.) On the other hand, I don't want to have to duplicate the entire .psqlrc
file to add one command, which is where \ir
comes in.
Here is what I use, for example:
.psqlrc-9.2
\ir .psqlrc \set QUIET yes \set COMP_KEYWORD_CASE preserve-lower \x auto \unset QUIET
.psqlrc-9.3
\ir .psqlrc-9.2
Tuesday, September 11, 2012
pgxnclient supports tarballs and HTTP
pgxnclient install http://pgfoundry.org/frs/download.php/3274/plproxy-2.4.tar.gzThis downloads, unpacks, builds, and installs. And the module doesn't need to be on PGXN. And of course you don't have to use HTTP; a file system location will work as well. I think this can be very useful, especially during development, when not everything is available in packaged form, or even for deployment, if you don't want to bother packaging everything and have been installing from source anyway.
Monday, August 13, 2012
Funny version numbers
It says a lot, after all. If I see
tool 2.4.1-2then I (subconsciously) think, yeah, the upstream maintainers are obviously sane, the tool has been around for a while, they have made several major and minor releases, and what I'm using has seen about one round of bug fixing, and a minimal amount of tweaking by the Debian maintainer.
On the other hand, when I see
tool 7.0.50~6.5~rc2+0.20120405-1I don't know what went on there. The original release version 7.0.50 was somehow wrong and had to be renamed 6.5? And then the 2nd release candidate of that? And then even that wasn't good enough, and some dated snapshot had to be taken?
Now, of course, there are often reasons for things like this, but it doesn't look good, and I felt it was getting out of hand a little bit.
I tried to look into this some more and find a regular expression for a reasonably sane version number. It's difficult. This is how far I've gotten: https://gist.github.com/3345974. But this still lists more than 1500 packages with funny version numbers. Which could be cause for slight concern.
Take a look at what this prints. You can't make some of that stuff up.
Wednesday, July 18, 2012
Tracing shell scripts with time stamps
set -x
you can turn on tracing, so that every command is printed before being executed.
In bash, you can also customize the output prefix by setting the PS4
variable. The default is PS4='+ '
.
Here is an example. I wanted to "profile" a deployment script, to see why it took so long. Ordinarily, I might have sprinkled it with date
calls. Instead, I merely added
set -x PS4='+\t 'near the top.
\t
stands for time stamp. (The script was already using bash explicitly, as opposed to /bin/sh
.) That way, every line is prefixed by a time stamp, and the logs could easily be analyzed to find a possible performance bottleneck.
Sunday, May 20, 2012
Base backup compression options
I've been looking at my PostgreSQL base backups. They are run using the traditional
tar -c -z -f basebackup.tar.gz $PGDATA/...
way (many details omitted). I haven't gotten heavily into using
pg_basebackup
yet, but the following could apply there just as well.
I had found some of the base backups to be pretty slow, so I dug a
little deeper. I was surprised to find that the job was completely
CPU bound. The blocking factor was the gzip
process. So it was
worth thinking about other compression options. (The alternative is
of course no compression, but that would waste a lot of space.)
There are two ways to approach this:
Use a faster compression method.
Parallelize the compression.
For a faster compression method, there is lzop
, for example. GNU
tar
has support for that, by using --lzop
instead of -z
. It
gives a pretty good speed improvement, but the compression results are
of course worse.
For parallelizing compression, there are parallel (multithreaded)
implementations of the well-known gzip
and bzip2
compression
methods, called pigz
and pbzip2
, respectively. You can hook these
into GNU tar
by using the -I
option, something like -I pigz
.
Alternatively, put them into a pipe after tar
, so that you can pass
them some options. Because otherwise they will bring your system to a
screeching halt! If you've never seen a system at a constant 1600%
CPU for 10 minutes, try these.
If you have a regular service window or natural slow time at night or on weekends, these tools can be quite useful, because you might be able to cut down the time for your base backup from, say 2 hours to 10 minutes. But if you need to be always on, you will probably want to qualify this a little, by reducing the number of CPUs used for this job. But it can still be pretty effective if you have many CPUs and want to dedicate a couple to the compression task for a while.
Personally, I have settled on pigz
as my standard weapon of choice
now. It's much faster than pbzip2
and can easily beat
single-threaded lzop
. Also, it produces standard gzip
output, of
course, so you don't need to install special tools everywhere, and you
can access the file with standard tools in a bind.
Also, consider all of this in the context of restoring. No matter how you take the backup, wouldn't it be nice to be able to restore a backup almost 8 or 16 or 32 times faster?
I have intentionally not included any benchmark numbers here, because it will obviously be pretty site-specific. But it should be easy to test for everyone, and the results should speak for themselves.
Tuesday, May 15, 2012
My (anti-)take on database schema version management
There were a number of posts recently about managing schema versions and schema deployment in PostgreSQL. I have analyzed these with great interest, but I have concluded that they are all more or less significantly flawed. (Of course, most of these solutions do in fact work for someone, but they are not general enough to become canonical go-to solutions for this problem class, in my opinion.) I have developed a list of elimination criteria by which I can evaluate future developments in this area. So here are some of the things that I don't want in my schema version management system:
Using schemas for distinguishing multiple versions (like this, but that's actually more about API versioning). That simply won't work for deploying objects that are not in schemas, such as casts, languages, extensions, and, well, schemas.
Using extensions (like this). Well, this could work. But extensions by themselves do nothing about the core problem. They are just an SQL wrapper interface around upgrade scripts. You still need to write the upgrade scripts, order them, test them, package them. The extension mechanism might replace the, say, shell script that would otherwise run the upgrade files in a suitable order. Another issue is that extensions require access to the server file system. Changing this is being discussed as "inline extensions", but there is no consensus. This is a smaller problem, but it needs to be thought about. Also, I do need to support PostgreSQL 9.0 and earlier for little while more.
Requiring naming each change (patch names, like this). Naming things is hard. Numbering things is easy. And how many good names are you going to still be able to come up with after 100 or so changes?
Take a lesson from file version control systems: versions are numbers or, if it must be, hashes or the like (UUIDs have been suggested).
Using a version control tool for tracking upgrade paths (like this). Sqitch, unlike the initial draft of this concept, doesn't actually require a version control tool for deployment, which wouldn't have worked for me, because what we ship is a tarball or a deb/rpm-type package. But it still requires you to maintain some kind of sanity in your version control history so that the tool can make sense out of it. That sounds fragile and inconvenient. The other choice appears to be writing the plan files manually without any VCS involvement, but then this loses much of the apparent appeal of this tool, and it's really no better than the "naming each change" approach mentioned above.
Taking snapshots or the like of a production or staging or central development system. Production systems and staging systems should be off limits for this sort of thing. Central development systems don't exist, because with distributed version control, every developer has their own setups, branches, deployments, and world views.
You could set it up so that every developer gets their own test database, sets up the schema there, takes a dump, and checks that in. There are going to be problems with that, including that dumps produced by
pg_dump
are ugly and optimized for restoring, not for developing with, and they don't have a deterministic output order.Storing the database source code in a different place or in a different manner than the rest of the source code. This includes using a version control system like mentioned above (meaning storing part of the information in the version control meta information rather than in the files that are checked into the version control system in the normal way), using a separate repository like Post Facto, or using something like the mentioned staging server.
The source is the source, and it must be possible to check out, edit, build, test, and deploy everything in a uniform and customary manner.
Allowing lattice-like dependencies between change sets (like most examples cited above). This sounds great on paper, especially if you want to support distributed development in branches. But then you can have conflicts, for example where two upgrades add a column to the same table. Depending on the upgrade path, you end up with different results. As your change graph grows, you will have an exploding number of possible upgrade paths that will need to be tested.
There needs to be an unambiguous, canonical state of the database schema for a given source checkout.
Requiring running through all the upgrade scripts for a fresh deployment (like this). There are two problems with this. First, it's probably going to be very slow. Fresh deployments need to be reasonably fast, because they will be needed for automated tests, including unit tests, where you don't want to wait for minutes to set up the basic schema. Second, it's inefficient. Over time, you might drop columns, add new columns, delete rows, migrate them to different tables, etc. If you run through all those upgrade scripts, then a supposedly fresh database will already contain a bunch of rubble, dropped attributes, dead rows, and the like.
Therefore, the current version needs to be deployable from a script that will not end up replaying history.
Using metalanguages or abstraction layers (like Pyrseas or Liquibase or any of the metaformats included in various web frameworks). It'd probably a good idea to check some of these out for simple applications. But my concern is whether using an abstraction layer would prevent me from using certain features. For example, look at the Pyrseas feature matrix. It's pretty impressive. But it doesn't support extensions, roles, or grants. So (going by that list), I can't use it. (It's being worked on.) And in a previous version, when I looked at it for a previous project, it didn't support foreign-data functionality, so I couldn't use it then either. And those are just the top-level things the author thought of. Actually, the Pyrseas author has gone through some effort to have almost complete coverage of PostgreSQL DDL features, so give this tool a try. But it won't be for everyone.
So, I'm going to keep looking.
Monday, May 14, 2012
Time to retrain the fingers
For years, no decades, I've typed tar tzf something
, tar xzf
something
. Except when someone annoying sent an uncompressed tar
file and I had to then go and take out the z
in the middle.
Then came bzip2
, and we learned tar tjf
, tar xjf
. OK, I could
live with that. One emerging problem was that the tab completion now
worked the wrong way around conceptually, because you had to pick and
type the right letter first in order to see the appropriate set of
files to unpack offered for completion.
Then came lzma
, which was (quick, guess?), tar tJf
, tar xJf
.
And then there was lzop
, which was too boring to get its own letter,
so you had to type out tar -x --lzop -f
.
But lzma
was short-lived, because then came xz
, which was also
J
, because lzma
was now too boring as well to get its own letter.
Oh, and there is also the old compress
, which is Z
, and lzip
,
which I'd never heard of.
But stop that. Now there is
-a, --auto-compress
use archive suffix to determine the compression program
This handles all the above compression programs, and no compression. So from now on, I always use tar taf
and tar xaf
. Awesome.
The finger movements will be almost the same on QWERTY and AZERTY, and easier than before on QWERTZ.
Actually, this option is already four years old in GNU tar. Funny I'd never heard of it until recently.