Tuesday, May 11, 2010

MERGE Syntax

The SQL MERGE statement has gotten my attention again. For many years, PostgreSQL users have been longing for a way to do an "upsert" operation, meaning do an UPDATE, and if no record was found do an INSERT (or the other way around). Especially MySQL users are familiar with the REPLACE statement and the INSERT ... ON DUPLICATE KEY UPDATE statement, which are two variant ways to attempt to solve that problem (that have interesting issues of their own). Of course, you can achieve this in PostgreSQL with some programming, but the solutions tend to be specific to the situation, and they tend to be lengthier than one would want.

Discussions on this then usually proceed to speculate that the SQL-standard MERGE statement ought to be the proper way to solve this, and then it turns out that no one completely understands the MERGE syntax or semantics, especially as they apply to this upsert problem. (I was in that group.) And that's been the end of that so far. OK, as I write this I am pointed, via Robert Haas's blog post, to an older mailing list post by Simon Riggs, who is surely one of the men most qualified to drive an eventual implementation, that contains a hint toward the solution, but it's hard to find in that post, if you want to try.

This subject had gotten my attention again at the SQL standard working group meeting I attended a few weeks ago, where I learned that in SQL:2011, a DELETE branch has been added to MERGE. We also took some time after the official part of the meeting to work through some examples that illustrate the uses of the MERGE statement.

Let's take a look at what the MERGE statement is originally supposed to do, and where the term "merge" arose from. Let's say you have a table with outstanding balances, such as
CREATE TABLE balances (
    name text,
    balance numeric
);
and at intervals you get a list of payments that your organization has received, such as
CREATE TABLE payments (
    name text,
    payment numeric
);
What you want to do then is to "merge" the payments table into the balances table in the following way:
  • If a balance exists, subtract from it.
  • If the balance goes to zero, delete the entire record.
  • If no balance exists, create a record (maybe someone pre-paid).
The command to do this would be:
MERGE INTO balances AS b
    USING payments AS p
    ON p.name = b.name
    WHEN MATCHED AND b.balance - p.payment = 0 THEN DELETE
    WHEN MATCHED AND b.balance - p.payment <> 0 THEN UPDATE SET balance = balance - p.payment
    WHEN NOT MATCHED THEN INSERT (name, balance) VALUES (p.name, -b.payment);
Of course there are simpler cases, but this shows all of the interesting features of this command.

How does this get us upsert? There, you don't have two tables, but only one and some values. I have seen some claims and examples about this in the wild that turn out to be wrong because they evidently violate the syntax rules of the SQL standard. So I did the only sensible thing and implemented the MERGE syntax into the PostgreSQL parser on the flight back, because that seemed to be the best way to verify the syntax. So the correct way, I believe, to do, say, an upsert of the balances table would be:

MERGE INTO balances AS b
    USING (VALUES ('foo', 10.00), ('bar', 20.00)) AS p (name, payment)
    ON p.name = b.name
    WHEN MATCHED AND b.balance - p.payment = 0 THEN DELETE
    WHEN MATCHED AND b.balance - p.payment <> 0 THEN UPDATE SET balance = balance - p.payment
    WHEN NOT MATCHED THEN INSERT (name, balance) VALUES (p.name, -b.payment);
Not all that nice and compact, but that's how it works.

Note that the AS clause after VALUES is required. If you leave it off, the PostgreSQL parser complains that a subquery in FROM needs an AS clause. Which is obviously not what this is, but it uses the same grammar rules, and it makes sense in this case because you need a correlation name to join against. And it was also one of those rare moments when you implemented something that gives you correct feedback that you didn't even provide for.

Anyway, the examples above all parse correctly, but they don't do anything yet. But if someone wants to implement this further or just try out the syntax, I'll send my code.

12 comments:

  1. I like it! Save the fact that it doesn't do anything, the syntax is better than Oracle's because DELETE is in its own branch. With Oracle, delete is appended to the UPDATE branch so you can't delete with out first updating something.

    ReplyDelete
  2. There's no "WHEN MATCHED THEN DO NOTHING" or sth?

    ReplyDelete
  3. > Especially MySQL users are familiar with the REPLACE
    > statement and the INSERT ... ON DUPLICATE KEY UPDATE
    > statement, which are two variant ways to attempt to
    > solve that problem (that have interesting issues of
    > their own).

    What exactly is your criticism of MySQL's approach (besides inventing their own syntax)?

    Is it the fact that they require a unique constraint? As far as I can tell, the only way to implement this atomically is indeed a unique constraint. The fact that MySQL's approach handles race conditions for you, without (relatively) expensive SAVEPOINTs, is a big plus in my opinion. The ability to restrict it only to a specific constraint would be nice of course.

    With the MERGE syntax I really can't see how that could be done without a retry loop at some level.

    PS: your blog doesn't let me preview or post with Firefox, quite annoying to have to re-type the whole thing!

    ReplyDelete
  4. This specific approach seems quite nice, and in general I find this that taking the syntax in direction is very good. After going from working with Oracle to PostgreSQL there were things I missed, but they're getting ticked off one by one it seems. I've got a few peeves about SQL too, but this moves towards improving things. I've ranted on occasion about how I don't like providing a set of isolated commands for changing things but would rather feed an "input set" into a command, sort of the reverse of getting a "result set" from a select. The approach of having to do a "first do one command and if it fails do another" is just the most obvious negative symptom of there being room for a fundamentally better way. I feel that such a unified approach is "more relational" than having a lot of commands that perform isolated units of work, since there's always some other way you want to combine data, even on input.

    ReplyDelete
  5. Perhaps I thought it was obvious, but when you don't want to do anything for the MATCHED case, then you leave that clause off. There is no way to "explicitly" do nothing.

    ReplyDelete
  6. I linked to Simon's good commentary and the rest of the better historical bits here at http://wiki.postgresql.org/wiki/SQL_MERGE last month; those used to be on the TODO list until I broke them out as a space to better describe them.

    Thanks for the syntax work, Simon is still considering doing the internals needed to make it work. Every bit someone works on here helps.

    ReplyDelete
  7. scottrbaileym,

    in Oracle you can do "when matched then update set myColumn=null WHERE 1=2".
    That will "DO NOTHING" on all rows.

    ReplyDelete
  8. Nice. It seems like this could make for a quick-and-easy materialized view in a number of simple cases...

    I'm guessing that cost would scale mostly with the size of p in this case, and that you'd want an index on b.name?

    ReplyDelete
  9. @intgr

    The MySQL REPLACE command has various funny behaviors in corner cases, which is presumably why it has been superseded by the INSERT ODKU command.

    ReplyDelete
  10. The code is now available here:

    http://git.postgresql.org/gitweb?p=users/petere/postgresql.git;a=shortlog;h=refs/heads/merge-statement

    ReplyDelete
  11. Peter, I wonder what happened to this branch? Do you have any plans to merge it into mainline Postgres, like in 9.3 or maybe later?

    ReplyDelete
  12. Well version 9.5 solved this issue http://stackoverflow.com/questions/17267417/how-do-i-do-an-upsert-merge-insert-on-duplicate-update-in-postgresql

    ReplyDelete