Thursday, October 8, 2009

Rethink your text column indexing with PostgreSQL 8.4

The following item in the PostgreSQL 8.4 release notes hasn't gotten much attention:
  • xxx_pattern_ops indexes can now be used for simple equality comparisons, not only for LIKE (Tom) 
But this might significantly change how you approach indexing of text (and varchar and char (brrrh)) columns.

Let's review.  Consider a table like this:
CREATE TABLE persons (
    id int PRIMARY KEY,  -- hi Josh
    name text,
    otherdata ...
);
Since you occasionally want to look up a row by name, like
SELECT * FROM persons WHERE name = 'Smith';
you add an index like this:
CREATE INDEX persons_name_idx ON persons (name);
Then you decide that you also want to do wildcard searches like
SELECT * FROM persons WHERE name LIKE 'Smi%';
(or using POSIX regular expression; doesn't matter for this purpose). After some bemusement you discover and blindly accept that you need the following index to make that work:
CREATE INDEX persons_name_like_idx ON persons (name text_pattern_ops);
And so it has become established practice, to some degree, to create two indexes on the interesting text fields: one for "normal" searches and one for pattern searches, with the ensuing hit on write performance.

Now what the mysterious release note item above says is that the pattern search index can now also be used for equality searches. So the index
CREATE INDEX persons_name_like_idx ON persons (name text_pattern_ops);
can be used to speed up the query
SELECT * FROM persons WHERE name = 'Smith';
(Try it out with EXPLAIN and enable_seqscan off in 8.3 and 8.4 to see the difference.)

So what do you need the "normal" index (persons_name_idx) for, then? Well, it will help you if you do range queries, like
SELECT * FROM persons WHERE name >= 'Smith' AND name <= 'Taylor';
But really, how often do you do range queries on text fields? Not very often. So here is the new plan. Next time you index a text field, think xxx_pattern_ops by default. It might be what you want more often than not. As a small bonus, I think the pattern_ops operator classes should also be slightly faster than the default ones, because they don't go through the full locale-enabled collation processing. And if you have been thinking two indexes so far, think only one index now. A great performance bonus there.

Here's a bummer: Let's say your SSSKA membership is up for renewal and you decide to do your table like this instead:
CREATE TABLE persons (
    name text PRIMARY KEY,
    otherdata ...
);
The primary key automatically creates an index using the default operator class, but as we have discovered now, we might want to have an index with a different operator class and only that one. The constraints only need to check for equality, so it shouldn't matter which of the operator classes it uses. But there is currently no way to specify an operator class for the index supporting primary key and unique constraints. Something to think about. OK, here is a completely evil way to do this: Edit the pg_opclass system catalog and switch the opcdefault setting around between the operator classes. Experiment at your own risk.

(Note: If you are using the C locale for lc_collate, this does not apply to you. Stick with the default operator class in that case.)

3 comments:

  1. The presentation you are linking to is really bad taste

    ReplyDelete
  2. "Bad taste" is an understatement! I wish I had seen the previous poster's comment prior to following that link!

    ReplyDelete
  3. Bad taste or not, the information in this blog post is great. Thank you for all of the analysis.

    ReplyDelete