Saturday, March 13, 2010

Running SQL scripts with psql

If you are using PostgreSQL, have you ever loaded an SQL script file or dump file through psql? Well, duh.

-f vs. <

If you are just starting out, you will probably try this:
psql mydb < dump.sql
and that's quite OK.

Once you hit your first error message such as
ERROR:  syntax error at or near "VEIW"
you might figure out that for some reason
psql mydb -f dump.sql
is better, because it produces
psql:dump.sql:56: ERROR:  syntax error at or near "VEIW"
instead, allowing you to actually find the error in your file.

Now I admit that it is almost entirely me who is to blame for this bizarre difference, because at some point in the distant past, the GNU Coding Standards recommended that programs should behave the same independent of whether the standard input or output is a terminal or a file. The current version of said standard actually explicitly creates an exception saying that error messages should be changed to the noninteractive style when the standard input is not from a terminal. So this should probably be fixed.

Note that the -f form above is not portable. It depends on the GNU getopt extension that permits options after nonoption arguments. To be portable, you need to write either
psql -d mydb -f dump.sql
or
psql -f dump.sql mydb
Frankly, I hardly ever do this because I rarely use a non-GNU system, but keep it in mind when writing scripts or documentation intended to be portable.

psqlrc

The next thing you should always do when running psql scripts is using the option -X, which prevents the reading of the .psqlrc file. Because that file could contain anything, and you have no idea how it will interact with your script.

In my mind, this is a design mistake in psql. Unix shells have different startup files for interactive and noninteractive usage, so they don't have this problem.

Quiet or Loud

Anyway, if you are restoring a dump, these commands will produce output that does something like this:
CREATE TABLE
CREATE TABLE
CREATE TABLE
CREATE FUNCTION
CREATE TYPE
CREATE FUNCTION
ALTER TABLE
ALTER TABLE
CREATE INDEX
CREATE INDEX
These are the responses from the DDL commands that are in your SQL script file. I personally like to not see these, because they don't tell me anything useful and they cause the important information such as error messages to fly rapidly off the screen.

Some people might like these responses because they serve as a kind of progress indicator. Which is true, but there is a flaw. Creating tables and functions is pretty fast. Chances are that that part of the dump file will fly way off your screen in two seconds. The slow part of a database restore are the COPY commands that restore the data. And those do not produce any response at all! So as a progress report, this output is not all that useful. OK, the ALTER TABLE parts at the end that create the foreign keys can be slow, and of course the CREATE INDEX commands, but in the above example, the longest wait would be after the last CREATE FUNCTION.

There are two ways to make this behavior more sane: If you don't want the "progress" output, only the errors and other important messages, use the option -q (or --quiet). This is probably more useful for restoring a dump on the console. If you want a full progress report, use the option -a (or --all), which will in addition to the default behavior print each statement from the file before executing it. This is probably fairly useful in batch jobs where you are logging the output to a file or somewhere. But most likely you will want to use one of these two options for almost any noninteractive invocation of psql. You can also somewhat usefully use both options at the same
time; think about it.

Notices

On the matter of controlling the output, you might want to hide the NOTICE messages such as
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "test1_pkey" for table "test1"
These get pretty boring quickly, especially when you create a lot of tables with a lot of primary keys.

psql doesn't provide a direct command-line option for turning this off, but there is an obscure method: See the environment variable PGOPTIONS to '--client-min-messages=warning', which will pass that option to the server process. So all together, your call might look like
PGOPTIONS='--client-min-messages=warning' psql -d mydb -q -f dump.sql

For some tastes, it might actually make sense to change the client_min_messages option permanently in postgresql.conf on the server.

Also note that analogous to the case of .psqlrc described above, PGOPTIONS could contain all kinds of settings that you don't want. So even if you don't want to disable notices as shown here, it would be most prudent to unset the environment variable in other cases.

Transactions

The next thing you might want to consider is using transactions, or rather a single transaction. Otherwise, if the execution of the script or restoration of the backup fails somewhere along the way, you have a half-loaded database and no easy restart point. Sometimes this doesn't matter that much; if you are restoring into a new database, you can just drop it and start over. If you are deploying updates into an existing system, much more care is required. The option to execute a file in a single transaction is -1 or --single-transaction.

Now depending on what exactly is in the file you want to load, using a single transaction may or may not work very well. For dump files created with pg_dump, it usually works, unless you used the -c/--clean option, in which case the DROP commands that appear at the beginning of the file might not have anything to drop and will fail, failing the entire transaction. That problem could be addressed if pg_dump used DROP IF EXISTS, but it doesn't.

When deploying schema changes into an existing database, this can get very tricky. For functions, you can usually use CREATE OR REPLACE FUNCTION to have an idempotent function creation command. Except that it will fail when the return type was changed. And there is no CREATE OR REPLACE AGGREGATE or CREATE OR REPLACE OPERATOR (yet?). With other object classes such as types or casts it can be easier to just unconditionally drop and recreate the objects. If you drop a type, however, everything that uses that type is also dropped, so be sure to recreate everything (such as functions) afterwards in the same transaction. This won't help when the type is used in tables, though; don't blow away your tables. Exactly how to manage this type of situation is a science of its own and would go beyond the scope of this post.

Side note: An alternative for some situations is using the psql option ON_ERROR_ROLLBACK, which allows you to ignore errors but still make use the atomicity property of transactions, useful when doing schema upgrades.

If you are convinced that running SQL scripts in transactions is the right thing, and you have adjusted your scripts to behave properly in that context, then you may also wish to consider the option
-v ON_ERROR_STOP=1
This causes psql to stop execution when a transaction fails. Otherwise it would continue to execute the rest of the script and issuing the error message
ERROR:  current transaction is aborted, commands ignored until end of transaction block
for every command, which makes no sense. There was a discussion a while ago about making ON_ERROR_STOP default to on when the single-transaction option is used, but it was not clear how this should interact with savepoints. If you are making use of savepoints, you may want to stay away from this option or evaluate yourself whether it makes sense for you.

Pager

If you run a psql script with the output on a terminal (which is normal when installing databases, restoring backups, etc.), the script executes a SELECT command, and the output doesn't fit on the screen, the output is run through the pager, which will normally wait for some key to be pressed to continue. This is obviously an extremely useful feature in psql in interactive mode, but it also happens when you run a script, which is dubious.

Often, this won't be a problem, because backup files for instance don't contain SELECT commands with large outputs. But you can simulate this by taking any dump file that restores a sequence, which will contain SELECT setval(...) calls. If you make your terminal window 4 lines or less, you can see the effect of this. Again, this is contrived, but every so often someone puts a SELECT command in a file to create some kind of visual verification that the tables or the functions or the data that the script was supposed to load is now really there. There is nothing wrong with that, except when you run the script from the terminal and forget to check back with extreme frequency. The old joke that someone left a transaction open before going to lunch has never been more real.

Add the option
--pset pager=off
to the psql invocation to disable this behavior.

Summary

OK, got all that? Write this down, here is how you really should execute an SQL script file in psql:

PGOPTIONS='--client-min-messages=warning' psql -X -q -a -1 -v ON_ERROR_STOP=1 --pset pager=off -d mydb -f dump.sql

Should be easier? I think so. Discuss.

Wednesday, March 10, 2010

Looking for Free Hosting

I'm looking for a way to do free hosting. But I mean free as in freedom, not free as in beer. Let me explain.

When I'm using a piece of free and open-source software such as OpenOffice.org, Evolution, or anything else, I have certain possibilities, freedoms if you will, of interacting with the software beyond just consuming it. I can look at the source code to study how it works. I can rebuild it to have a higher degree of confidence that I'm actually running that code. I can fix a bug or create an enhancement. I can send the patch upstream and wait for the next release, or in important cases I can create a local build. With the emerge of new project hosting sites such as GitHub, it's getting even easier to share one's modifications so others can use them. And so on.

As a lot of software moves to the web, how will this work in the future? There are those that say that it won't, and that it will be a big problem, and that's why you shouldn't use such services. Which is what probably a lot of free-software conscious users are doing right now. But I think that in the longer run, resisting isn't going to win over the masses to free software.

First of all, of course, the software would need to be written. So a free web office suite, a free web mail suite that matches the capabilities of the leading nonfree provider, and so on. We have good starts with Identi.ca and OpenStreetMap, for example, but we'd need a lot more. Then you throw it on a machine, and people can use it. Now as a user of this service, how do I get the source code? Of course you could offer a tarball for download, and that is the approach that the AGPL license takes. One problem with that is, if you are used to apt-get source or something similar for getting the source, everyone putting a tarball on their web site in a different place isn't going to make you happy. A standardized packaging-type thing ought to be wrapped around that. Another problem is that even if you trust the site's operator that that's the source code that's actually running on your site (even without malice, it could for example be outdated against the deployed version), it probably won't contain the local configuration files and setup scripts that would allow me to duplicate the service. And if I just want to study how the program is running in actuality, there is not much I can do.

Giving everyone SSH access to the box is probably not a good idea, and won't really solve all the issues anyway. In the future, when virtualization is standardized, ubiquitous, and awesome, one might imagine that a packaging of a web service won't be "put these files on the file system and reload a daemon" but instead "put these files and this configuration on that machine and activate it". This might give rise to a new generation of Linux "distributors". Getting the source tarball or "source package" might then involve getting a snapshot of that image, which you can examine, modify, and redeploy elsewhere. That could work for OpenStreetMap, for example, modulo the space and time required for their massive database. (But you might chose to fork only the code, not the data.) But it won't be easy to do the right thing in many cases, because with a web service, there is usually other people's data on the machine as well, which would need to be masked out or something. Maybe this really can't be done correctly, and the future will be more distributed, like in the way Jabber attempted to supplant centralized services such as ICQ. Distributed web mail makes sense, distributed OpenStreetMap perhaps less so.

Ideas anyone? Does anyone perhaps have experiences with running a web service that attempts to give users the freedoms and practical benefits that are usually associated with locally installed software?

Friday, February 5, 2010

My Favorite PostgreSQL 9.0 Feature

My favorite PostgreSQL 9.0 feature does not have a two-letter acronym. It's the new bytea format, available since 8.5alpha1.

At F-Secure Labs, as you might imagine, we store information about a bunch of malware samples. Throughout the computer security industry, file samples, malware or perhaps not, are referred to by a hash value, such as MD5, SHA1, or SHA256. The typical representation of such hash values in most programming environments and also in prose and literature is the hexadecimal format, for example da39a3ee5e6b4b0d3255bfef95601890afd80709. Except when you want to write a test case against PostgreSQL or want to track down a problem, you'll be looking for \3329\243\356^kK\0152U\277\357\225`\030\220\257\330\007\011, also known as \\3329\\243\\356^kK\\0152U\\277\\357\\225`\\030\\220\\257\\330\\007\\011 in some contexts.

Well, that's over now; it will show as \xda39a3ee5e6b4b0d3255bfef95601890afd80709. You will still need to take care of the backslash, but that will surely be resolved when standard_conforming_strings is turned on in version 10.0 ;-), or we implement a new, SQL-standard conforming binary string type without legacy syntax issues.

By the way, the actual origin of this feature idea was a performance problem, reported by Bernd Helmle. The new format is quite a bit faster to encode and decode: In some internal tests, pg_dump of tables with mostly binary data was twice as fast and created a dump file that was half the size with the new format compared to the old format. So hopefully everyone wins.

Another new feature in the same area, by the way, is that PL/Python now supports bytea values sanely, contributed by Caleb Welton.

Sunday, January 31, 2010

Going ...

I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting

See you there! Or maybe even there.

Wait ... I have the last slot on Saturday and the first slot on Sunday?!? Great! :^)

Monday, January 25, 2010

PostgreSQL: The Universal Database Management System

I'm glad you asked, since I've been pondering this for a while.  $subject is my new project slogan.  Now I'm not sure whether we can actually use it, because a) it's stolen from Debian, and b) another (commercial, proprietary) database product already uses the "universal database" line.

I have come to appreciate that the "universality" of a software proposition can be a killer feature.  For example, Debian GNU/Linux, the "universal operating system", might not be the operating system that is the easiest to approach or use, but once you get to know it, the fact that it works well and the same way on server, desktop, and embedded ensures that you never have to worry about what operating system to use for a particular task.  Or Python, it's perhaps not the most geeky nor the most enterprisy programming language, but you can use it for servers, GUIs, scripting, system administration, like few other languages.  It might as well be the "universal programming language".  A lot of other software is not nearly universal, which means that whenever you move into a new area, you have to learn a lot of things from scratch and cannot easily apply and extend past experiences.  And often the results are then poor and expensive.

The nice thing about PostgreSQL is that you never have to worry about whether to use it, because you can be pretty sure that it will fit the job.  Even if you don't care whether something is "open source" or "most advanced".  But it will fit the job.  The only well-known exception is embedded databases, and frankly I think we should try to address that.

Tuesday, January 12, 2010

Procedural Languages in PostgreSQL 8.5: The One That Works!

While much of the PostgreSQL hacker world is abuzz over two-letter acronyms (HS, SR, VF), I will second Andrew's post and will generalize this to say, partially tooting my own horn, of course, that the next PostgreSQL release will be a great one for procedural languages. Behold:
  • PL/pgSQL is installed by default.
  • New DO statement allows ad hoc execution of PL code.
  • PL/pgSQL finally got a sane parser.
  • PL/Perl got a shot in the arm.
  • PL/Python got saner data type handling, Unicode support, and Py3k support.
  • Not directly related, but the coming PL/Proxy features are looking promising as well.
  • (Meanwhile, language historians will be interested to know that PL/Tcl has received exactly zero feature or bug-fix commits since 8.4.)
This will be a great boost for PostgreSQL the development platform.

Monday, January 4, 2010

Remove and Purge

Debian's package manager dpkg has the perhaps unique feature that it distinguishes between removing and purging a package. Removing it removes the program files but keeps the configuration files (and sometimes the logs) around, purging it really removes everything. While this distinction undoubtedly has some uses, I have found that I almost never make use of it. I think in about six years of using Debian I have actually needed a remove-but-not-purge functionality about five times, during some really tricky upgrades (and using Aptitude instead of APT might have helped, not sure) and once when I wanted to build a package that had a build dependency that conflicted with a package I had installed (cowbuilder came later).

I think many people don't fully realize this distinction, and thus aged systems will often contain dozens or hundreds of removed-but-not-purged packages lying around. Great fun cleaning that up. And therefore, at some point in the distant past I have switched all my APTs to purge by default, using the configuration setting Apt::Get::Purge "true";. At the time I thought this would be daring, but I have never looked back. The one time a year that I don't want to purge I override this by hand.

Later, APT actually got an apt-get purge command, but there is no apt-get autopurge and no apt-get dist-upgrade-and-purge (or whatever) to purge the packages it wants to remove. This can be worked around by carefully adding --purge to all invocations of apt-get, but who will remember that. And of course apt-get remove is hardwired into my fingers.

How do other people handle this? Are there undiscovered reasons removing is the better default? How do you clean up packages that were forgotten to be purged?