Thursday, October 29, 2009

A History of Tarballs

I have been maintaining the autoconfigury of PostgreSQL for many years now, and every once in a while I go to ftp://ftp.gnu.org/gnu/autoconf/ to check out a new version of Autoconf. That FTP listing is actually an interesting tale of how tarball creation practices have evolved over the years.

Obviously, .tar.gz has been the standard all along. Some projects have now completely abandoned .tar.gz in favor of .tar.bz2, but those are rare. I think most ship both now. The FTP listing goes back to 1996; the first .tar.bz2 was shipped in 2001.

RPM-based distributions have switched to supporting and then requiring bzip2-compressed tarballs many years ago. Debian might start supporting that with the next release. So if you want to be able to trace your pristine tarballs throughout the popular Linux distributions, shipping both is best.

One thing that was really popular back then but is almost forgotten now is providing patches between versions, like autoconf-2.12-2.13.diff.gz. The Linux kernel still does that. Autoconf stopped doing that in 1999, when it was replaced by xdelta. Anyone remember that? This lasted until 2002 and was briefly revived in 2008. I think shipping xdeltas is also obsolete now except possibly for huge projects.

In 2003, they started signing releases. First with ASCII-armored signatures (.asc), now with binary signatures (.sig). The Linux kernel also does this, except they call the ASCII-armored signatures .sign.

In 2008, we saw the latest invention, LZMA-compressed tarballs (.tar.lzma). They appear to compress better than bzip2 by about as much as bzip2 wins over gzip. But, this one's already obsolete because it was replaced in 2009 by LZMA2, which goes by the file extension .tar.xz. Some "early adopters" such as Debian's packaging tool dpkg are in the process of adding xz support in addition to the short-lived lzma support.

Throughout all this, interestingly, tar hasn't changed a bit. Well, there are various incompatible extended tar formats around, but when this becomes a problem, people tend to revert to GNU tar.

GNU tar, by the way, supports all the above compression formats internally. gzip is -z, bzip2 is -j, lzma is, well, --lzma, and xz is -J.  And Automake supports creating all these different formats for source code distributions.

Saturday, October 24, 2009

Attention PL/Proxy Users: Hash Functions Have Changed in PostgreSQL 8.4

Consider the following elementary PL/Proxy example:
CREATE FUNCTION get_user_email(username text)
RETURNS text AS $$

    CLUSTER 'userdb';
    RUN ON hashtext(username);

$$ LANGUAGE plproxy;
The integrity of this setup depends on (among other things) the hash function always giving the same result for the same username. Otherwise your calls go to the wrong partition and you won't find your data again.

Unfortunately, the hashtext() function and other hash functions have changed their implementation between PostgreSQL 8.3 and 8.4. Observe:
8.3=> SELECT hashtext('foobar');
 hashtext
-----------
 504683490
(1 row)

8.4=> SELECT hashtext('foobar');
 hashtext
-----------
 289967942
(1 row)
So when you update your proxy database from 8.3 to 8.4, you will likely make all your data invisible and/or create a big mess.

Solution? Well, in the short run: don't update quite yet. If you're just starting or you have a small database, reload all your data through the proxy instance after upgrading. The best solution for now appears to be forward-porting 8.3's hash function to 8.4 as an add-on module. Eventually, it would probably be best if PL/Proxy itself provided a stable set of hash functions.

Thursday, October 8, 2009

Rethink your text column indexing with PostgreSQL 8.4

The following item in the PostgreSQL 8.4 release notes hasn't gotten much attention:
  • xxx_pattern_ops indexes can now be used for simple equality comparisons, not only for LIKE (Tom) 
But this might significantly change how you approach indexing of text (and varchar and char (brrrh)) columns.

Let's review.  Consider a table like this:
CREATE TABLE persons (
    id int PRIMARY KEY,  -- hi Josh
    name text,
    otherdata ...
);
Since you occasionally want to look up a row by name, like
SELECT * FROM persons WHERE name = 'Smith';
you add an index like this:
CREATE INDEX persons_name_idx ON persons (name);
Then you decide that you also want to do wildcard searches like
SELECT * FROM persons WHERE name LIKE 'Smi%';
(or using POSIX regular expression; doesn't matter for this purpose). After some bemusement you discover and blindly accept that you need the following index to make that work:
CREATE INDEX persons_name_like_idx ON persons (name text_pattern_ops);
And so it has become established practice, to some degree, to create two indexes on the interesting text fields: one for "normal" searches and one for pattern searches, with the ensuing hit on write performance.

Now what the mysterious release note item above says is that the pattern search index can now also be used for equality searches. So the index
CREATE INDEX persons_name_like_idx ON persons (name text_pattern_ops);
can be used to speed up the query
SELECT * FROM persons WHERE name = 'Smith';
(Try it out with EXPLAIN and enable_seqscan off in 8.3 and 8.4 to see the difference.)

So what do you need the "normal" index (persons_name_idx) for, then? Well, it will help you if you do range queries, like
SELECT * FROM persons WHERE name >= 'Smith' AND name <= 'Taylor';
But really, how often do you do range queries on text fields? Not very often. So here is the new plan. Next time you index a text field, think xxx_pattern_ops by default. It might be what you want more often than not. As a small bonus, I think the pattern_ops operator classes should also be slightly faster than the default ones, because they don't go through the full locale-enabled collation processing. And if you have been thinking two indexes so far, think only one index now. A great performance bonus there.

Here's a bummer: Let's say your SSSKA membership is up for renewal and you decide to do your table like this instead:
CREATE TABLE persons (
    name text PRIMARY KEY,
    otherdata ...
);
The primary key automatically creates an index using the default operator class, but as we have discovered now, we might want to have an index with a different operator class and only that one. The constraints only need to check for equality, so it shouldn't matter which of the operator classes it uses. But there is currently no way to specify an operator class for the index supporting primary key and unique constraints. Something to think about. OK, here is a completely evil way to do this: Edit the pg_opclass system catalog and switch the opcdefault setting around between the operator classes. Experiment at your own risk.

(Note: If you are using the C locale for lc_collate, this does not apply to you. Stick with the default operator class in that case.)

Tuesday, September 29, 2009

How to make LaTeX Beamer prettier

Having held a number of presentations at free software conferences and training events over the years, I have played with a bunch of presentation tools and keep coming back to LaTeX Beamer as the least of many evils, just as many others who go to these sorts of events appear to have done. Like most Beamer users, it appears, I had stuck with the default theme: Warsaw, blue. It's almost a trademark of Beamer by now.

Occasionally, however, I get somewhat jealous of the look created by Powerpoint, Impress, Keynote, or some other more evil tool. So finally I have collected some tips on how to make Beamer presentations less ordinary, but not silly, just better looking. Actually, the Beamer manual contains much of this information as well, so you can check that for the details, but the Beamer manual seems more targeted toward writing a mathematics lecture, not a technical presentation for an IT conference. So here it goes:
  • Create your own color theme. The standard blue is boring. Your company or organization probably has brand colors; use those. Normally, they are designed to look good together. (But if your primary color is supposed to be some shade of red, be careful. Sometimes, red slides look quite bad.) If you really don't have a color scheme, google for "color scheme creator" and make one.
  • When you implement your color theme, be sure to also set the colors for alerts and blocks, and anything else you might use. These are good places to apply secondary colors that your company's brand guidelines might supply.
  • Use a different style theme. Unless you are writing a semester-length lecture series (which you are probably not, if you are reading this), avoid all the themes that have always-on table of contents, progress bar, and other distracting stuff. Most of the themes don't look very good, in my opinion, but that is partially because of the funny colors. So the advice is, change the colors and theme in unison. It's not that hard; it took me a day to figure out by trial and error. Once you are done with this, save this as your or your organization's presentation template for the future.
  • Use \beamertemplatenavigationsymbolsempty to remove the navigation symbols from the PDF. No one needs them, they look weird, and they kind of reveal that you are using Beamer. Put that command in your theme file.
  • Change the font. Well, if you are going to a conference full of Windows users and want to show off your geek tool, leave the font. On the other hand, if you want your presentation to look more like Windows, that is, PowerPoint, use the Helvetica font. There are also other good fonts available that are neither Linux nor Windows biased. Also change the fonts for the verbatim environments. The same really goes for anything produced with LaTeX. Change the fonts. The defaults are from the 80s.
  • I strongly recommend the "upquote" package, which makes sure that an apostrophe in a verbatim environment (where you put your code samples) is upright like the ASCII apostrophe, not curly like a quotation mark. Besides looking better and more correct, this has the very practical benefit that copying and pasting code out of your PDF slides actually works correctly. You just have to include that package and everything works.
  • I advise against the "listings" package, especially for SQL code. It does not know nearly all the key words, it uses funny fonts, and looks pretty ugly. If some could fix that, I would be mildly interested. For other languages, for example Python, the results are much better, if you change the fonts and colors. Results may vary. But for SQL, the verbatim environment and a modern typewriter font work much better.
  • For diagrams, schemas, charts, and graphics of that sort, I recommend the "tikz" package. This was my key discovery of late. In the past, creating a flow chart or a deployment diagram meant using something like Dia or OpenOffice.org to construct the image (a painful experience in itself), then exporting that file in various formats until you find one that LaTeX can accept without butchering the image, and then scaling the image to fit on the page. And then if it doesn't look good or you want to change something, you repeat the cycle. With tikz, you issue drawing commands in TeX. Sounds scary, is scary, but it's not that hard. With the excellent documentation and tutorial, you can get the hang of it in a few days. This approach has awesome advantages: The graphics use the same fonts and colors as the rest of your document (the fonts and colors you customized, remember?). When you change the font family or size or a color, your graphics follow suit! All the graphics and fonts scale without problems. And everything is in one file; you don't have to worry about exporting and importing anymore. And you can use version control and diff and make and so on. And if the graphics have a repetitive nature and you are slightly daring, you can even program your graphics in TeX, with loops and automatic layout and all that.
That's it. Compare my SQL/MED presentations before and after I applied these tips. It's actually not so difficult once you figure out the right magic, as is typically the case with TeX things.

Tuesday, September 22, 2009

How to submit a patch by email

Submitting a patch by email isn't hard, but doing it well requires some attention.
  • Create the patch from the top level of the source tree. That means, when you are creating the patch, you should be in the directory that has configure, or in the directory above it, or at least you should make your patch look as if you were. (Git does that for you automatically.) In other words, the patch should apply with -p0 or -p1. Even if you are submitting a patch to PL/Perl, do it from the top level. Consistency is nice.
  • Add all new files into the patch. If you are using anonymous CVS, you are kind of out of luck with that. Switch to Git, or if you are daring, CVSup. Or if using plain diff, add the -N option, for example diff -c -N -r. Please don't send new files separately alongside with the patch. They should be in the patch.
  • I personally don't subscribe to the dogmatic insistence on the diff -c format. Sometimes -c works better, sometimes -u. Judge for yourself. Make the patch readable.
  • Don't compress the patch. This might be controversial, but I think unless your patch is huge, compression adds annoyance and saves little else. Because the mail reader at the other end probably won't be able to open the attachment directly, sending the reader through intermediate programs or on a side-trip to the console. Replying with the patch inline to comment probably won't work either. And the mailing list web archive isn't going to do anything useful with that kind of attachment. (It is sometimes submitted that compressing an attachment prevents mangling by the mail software. That is true, but if that is a problem, you should consider getting better mail software.)
  • If you have to compress, stick with gzip. Some mail and web software can process that usefully. If you go with bzip or something else, those chances are decreasing drastically.
  • Whatever you do, there is almost never a reason to wrap a patch into a tar archive. If you must compress it, gzip is enough. If you have to send more than one file, create multiple attachments.
  • I'm not sure if anyone cares about diffstats. I don't. Patches are not judged by how much code they add or remove or which places they touch.
  • Make sure you send your patch with a proper MIME Content-Type header. You might have to look a little harder into your mail client configuration to change this. It should be text/x-diff or text/x-patch or perhaps text/plain, although I think the latter is technically incorrect. But it should be text/something, so people can open the attachment with a text viewer easily. If the attachment has a weird MIME type, you force your readers to save the attachment to disk, then open it from there, which is just annoying. Also, you'll confuse the mailing list web archive in a similar way. If you compress your patch, make extra sure that the MIME type is correct. Don't send a gzip compressed file claiming it is a tar archive.
  • Give your patch a descriptive file name. Not "patch", but perhaps "hot-standby.patch". Make sure your mail client includes that file name into the description of the attachment (in Content-Disposition), so your readers can save the patch under the same name. Don't use absolute file names for this ("/tmp/mypatch").
  • If you are using Git, you may be tempted to rebase your patch into tiny pieces and submit those as N separate email messages. Don't do that. There is nothing wrong with separating a patch into smaller patches. But the standard for commiting is not the smallest possible change that doesn't break anything, it's the smallest possible change that we would want to release if you didn't finish the next one on time.
Those are just the issues I noticed browsing through the current commit fest. Please take a moment to check your patch submission practices and your email client. A good way to check is trying to open your own patch from your own email client, and trying to view the patch in the mailing list web archive.

Monday, September 21, 2009

How to set xterm titles in psql

As most of you might know, you can customize the prompts in psql, the PostgreSQL interactive terminal. If you didn't know, you can read up on it in the documentation. You can also add color to your prompt using VT100 control sequences; an example of that can be found in said documentation. I have my shell and psql prompts in different colors so I can tell quickly when I'm logged in to what. It spares me the embarrassment of typing "ls" into psql.

As it turns out, you can use that same mechanism to customize the xterm titles from within psql. It might not be completely clear why that is necessary. After all, if you have customized the xterm titles in your shell, say to show the name of the command currently running, then it would already show something like "psql somedb" when you are using psql. The trick, or perhaps danger, with that is that if you use commands like \c or SET SESSION AUTHORIZATION in psql, then your connection parameters change, but the shell doesn't update the xterm title. This had me confused more than once now.

The control sequence to change the xterm title is ESC]0;stringBEL, where ESC is \033 and BEL is \007. In psql prompt language, this becomes [%033]0;string%007%]. Add that somewhere to your prompt (doesn't really matter where, as it won't be printed). Say you are using the default prompt (boring!) and want to show "psql dbname" in the xterm window title when logged in to PostgreSQL, then put this into your ~/.psqlrc:
\set PROMPT1 '%[%033]0;psql %/%007%]%/%R%# '

Note: If you don't have your shell to set up to update the xterm title, then the title set by psql will remain after you quit psql. That might be even more confusing, so don't try this if you don't have your shell set up for this.

If you have other fun and/or useful ideas for the psql prompt, please share them in the comments.

Sunday, August 30, 2009

PL/XSLT

Let me introduce the procedural language that you didn't yet know you needed: PL/XSLT.

PL/XSLT isn't actually that new. I faintly recall that I wrote it around the time of the Anniversary Summit in 2006, but never published it. Today I found the source code again, cleaned it up a bit, and put it up on GitHub: http://github.com/petere/plxslt/

Here is roughly how this would look:
CREATE FUNCTION foo(xml) RETURNS xml AS $$
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xsl="http://www.w3.org/1999/XSL/Transform">
...
</xsl:stylesheet>
$$ LANGUAGE xslt;

There are some conditions on the function definition:

  • The first parameter must be of type xml. It will receive the input document. Every function must have at least this one parameter.

  • The following function parameters will be supplied to the stylesheet as XSL parameters.

  • The return type must match the output method specified by the stylesheet. If it is xml, then the return type must be xml; if it is text or html, the return type must be text or varchar.

  • Triggers are not supported.


In the source code, I include an example function that converts the output of the table_to_xml etc. functions to HTML.

Of course, the question will arise whether this is sensible. I guess it is about as sensible as considering XSLT a programming language. Judge for yourself. :-)

In the words of GitHub: fork the project and add to it.