Wednesday, July 8, 2009

Schema Search Paths Considered Pain in the Butt


OK, I've had it. Who ever thought that a search path was a good idea anyway? I'm talking about the schema search path in PostgreSQL in particular, but any other ones are just as bad. If you review the last 48 hours in the PostgreSQL commit log, you can see what kind of nonsense this creates, and what amount of resources it drains (with apologies to my fellow developers who had to clean up my mistakes).

The Schema Search Path

What is the schema search path? One way to describe it is that it allows you to specify the objects in which schemas can be addressed without schema qualification, and in which order those objects are resolved. So if you write
CREATE TABLE schema1.foo ( ... );
CREATE TABLE schema2.bar ( ... );
SET search_path TO schema1, schema2;
then you can address these tables as just
SELECT ... FROM foo, bar ...
instead of
SELECT ... FROM schema1.foo, schema2.bar ...
Another way to describe it is that it allows you to set hidden traps that mysteriously make your perfectly good SQL behave in completely unpredictable ways. So a perfectly harmless query like
SELECT * FROM pg_class WHERE relname ~ 'stuff$';
can be changed to do pretty much anything, if the user creates his own pg_class and puts it first into the search path, before the system schema pg_catalog.

The way to deal with this, to the extent that it is dealt with at all, is to either set the schema search path everywhere before accessing any object (for some value of everywhere, such as every script, every session, every function) or to explicitly schema-qualify every object reference. The problem with the former is that it negates the point of allowing the user to set the search path in the first place. What you'd need to do instead is to prepend your known search path to what was set already, which is a pain to program in SQL, which is why probably hardly anyone does it that way. The problem with the latter is that it makes schemas pointless, because if you have to qualify everything, you might as well make the schema part of the name ("pg_catalog_pg_class") and forget about the extra indirection.

Qualify Everything!

The way we usually deal with this in PostgreSQL core code, for example in the psql describe code, is to add pg_catalog qualifications everywhere; see this last commit for instance. In my small example above, this would become
SELECT * FROM pg_catalog.pg_class WHERE relname ~ 'stuff$';
But this, and the above commit and all of psql is still wrong, because the actually correct way to write this is
SELECT * FROM pg_catalog.pg_class WHERE relname OPERATOR(pg_catalog.~) 'stuff$';
because PostgreSQL of course allows user-defined operators, and those are subject to search path rules like anything else. Proof:
CREATE OPERATOR = ( PROCEDURE = charne, LEFTARG = "char", RIGHTARG = "char" );
SET search_path = public, pg_catalog;
\diS
This command is supposed to show all indexes, but now it will show everything that is not an index. Try this yourself; it's a great way to mess with your colleagues' heads. :-) Anyway, surely no one wants to schema-qualify all operators, which is why attempting to qualify everything is probably a hopeless proposition.

Unix Shells

Let's look for some analogies to this dreaded technology. Unix shells have had search paths for decades. I don't have a whole lot of insight into their history, or what they where originally intended for, but the way I see it now, from the point of view of an operating system development contributor myself, they are just as much a pain in the butt in their full generality. Because users can set hidden traps that can make any old harmless command do anything, various programming and security pains persist. The usual answers are again: set the path explicitly in every shell script, or add an explicit path to every command. The latter is obviously rarely done, and would by the way possibly fall into the same operator trap that I mentioned above. Who knew that [ (as in if [ "$foo" = bar ]) might actually be a separate executable:
$ ls -l '/usr/bin/['
-rwxr-xr-x 1 root root 38608 2009-06-05 03:17 /usr/bin/[

Many authors of shell scripts probably know that setting an explicit path at the top of the script is probably a good idea. Or is it? On this Debian system, I have 242 shell scripts in my /usr/bin/, 181 of which don't appear to set a path. 14 out of 28 scripts in /etc/init.d/ don't set a path. And out of the 14 that do, there are 12 different variants of the path that they actually choose to set. Which makes any actually sensible use of the path impossible. One such use might be to put a locally modified version of a common program into /usr/local/bin/. Now some scripts will use it, some will not.

Linux distributions have been trying to move away from relying on this path business anyway. The Filesystem Hierarchy Standard only allows for a few directories for executable programs. Everything else is supposed to be organized by the package manager, and in practice perhaps by the occasional symlink. Solaris has a much greater mess of bin directories, such as /usr/gnu/bin, /usr/xpg4/bin, but even there I sense that they don't want to keep going into that direction.

Whither PostgreSQL?

In PostgreSQL, you might get the worst of all worlds:
  • Explicitly qualifying everthing is completely unwieldy, and perhaps impossible if you can't modify the SQL directly.

  • Programmatically modifying the search path in a correct way is pretty complicated.

  • There is no standard for what goes into what schema.

  • There is talk of adding more "standard schemas", potentially breaking all those applications who managed to get one of the above points right.

  • Using the search path to override a system object by your own implementation doesn't work, at least in half of the applications that choose to do the supposedly right thing and alter the search path to their liking.

  • Security definer functions (analogous to setuid scripts) don't have a safe default for the search path. You need to make sure you set a safe path in every one of them yourself (see also CVE-2007-2138). One wonders how many of those out there are broken in this regard. At least setuid shell scripts are usually not possible.
What might be a better solution? More modern programming languages such as Java or Python with their "import" mechanisms, where you explicitly choose what you want in what namespace, appear to have less problems of the sort I described. Of course there is still a CLASSPATH or a PYTHONPATH in the background, but that's actually just a workaround because the underlying operating system relies on search paths and doesn't have an import mechanism itself.

I think the straightforward solution at the moment is to ignore the whole schema issue as much as you can and don't let random people issue random SQL statements. Probably not completely safe, but at least it doesn't tangle you up in a giant mess.

If someone wants to defend the current system or has proposals for fixing it, please leave comments.

(picture by david.nikonvscanon CC-BY)

16 comments:

  1. I have neither desire to defend the status quo, nor (right now, anyway) proposals to fix it, but I hope I might be permitted a comment anyway to agree wholeheartedly. The recent PostgreSQL modules discussion seems to have made it pretty clear that search paths suck.

    ReplyDelete
  2. The only time I actually got into one of those searchpath traps was when I had some restoring issue of a dump which used functional indexes with functions found in another schema - steered my way around this one, though it was not such a pleasant experience. This had to do with pg_dump setting search paths explicitly - cannot remember the full details.
    Anyway, I see it as a great feature of PostgreSQL to be able to set up a new connection to the same database, just different user, and have a set of tables/functions in the schema named after the user just override those from public when using unqualified identifiers, just by setting "$user$" at the beginning of the search path. In our case this was a most elegant solution for our problem and I wouldn't want to miss this feature in my favourite RDBMS. I do see the problems with this, as sometimes matters become not just confusing but outright broken and there seems to be just no way to really fix this. There are however valid use cases, even if the concept is flawed.
    Sorry for your hassle with this, though :)

    ReplyDelete
  3. Agree 99%-heartedly. Personally, I always schema-qualify everything.

    But I disagree that "you might as well make the schema part of the name". There are a lot of benefits to modularization other than omitting keystrokes! "pg_dump -n/-N" for starters.

    ReplyDelete
  4. From an end-user's point of view, and one that divides his time between programming in Oracle and Postgres, I've got to say that I love Postgres' search path method. It is SO much better than the Oracle alternative of creating synonyms.

    ReplyDelete
  5. I also had to struggle a bit with search paths, but they allow for a lot of control. It makes it possible to have the appearance of one database, but have things in different schemas. Yes, you can override things with the search path, but that's a feature.

    PostgreSQL's search path is exactly like PATH where you can do the same, so would like to keep the current behaviour.

    ReplyDelete
  6. Interesting article. Reminds me of all the pitfalls I luckily never have fallen into (both with PostgreSQL and shell scripting). Maybe I should add "yet" but I've been at it a long time and managed to stay clear most of the times.

    The really interesting thing though is the idea of adopting an import mechanism to replace search paths. Do you have more precise idea of how this could work in PostgreSQL (or bash)?

    ReplyDelete
  7. While i agree with some of your concerns, i see the current search path behavior in its current flexibility as one advantage of PostgreSQL.

    Every flexible system allows you to damage something, it's a matter of discipline i think. If you are going to overload operators and you are reyling on a specific behavior via explicit naming pg_catalog (or other system schemas) in your search_path, you are already aware of what you are doing (if not, what's the point of this then?).

    However, someone might expect psql backslash commands to be stable, maybe it's time to introduce separate namespaces for SQL and backslash commands in our favorite command line tool?

    ReplyDelete
  8. Hi Peter,

    being able to tweak PATH in Unix is a positive feature. If a user-set PATH breaks anything, it's up to the user to clean up the mess that was created using his user id. Setuid programs like sudo make sure to reset PATH to something "safe" before changing users. Non-setuid programs/scripts messing with PATH are broken.

    In PostgreSQL, the issue is worse because there is the "public" schema. (There's no analogon for that in Unix, maybe putting "." in your PATH, but that's neither a default, nor a good idea.) Luckily, pg_catalog is prepended to search_path so the worst surprises won't happen.

    As in Unix, unless you are using SECURITY DEFINER functions, you get to keep all the trouble you are asking for. That's life :)

    What about REVOKE CREATE ON SCHEMA public FROM ALL for the default installation? That way, only super users could inject evil objects there. As with many security problems, this is also one caused by bad/unfortunate default setting.

    ReplyDelete
  9. And people want to keep schema and search_path, and add extensions in the game, with the option as a user to choose in which schema to install any given extension.

    As an extension author, I'm yet to find a way to be able to write predictable code without knowing in what schema it'll get installed (you can't qualify your internal extension function calls, or your own tables/operators/whatever).

    Oh and this has to support any procedural language too, for example an extension might contain plperl functions, and the explicit calls are then hidden in perl code that I don't want to parse even for basic search and replace.

    It you have another idea than something like @@schema@@ or %%schema%%, I'm listening :)

    Bottom line: seeking for help on extensions, schema and search_path interaction. Really. That's a big mess...

    ReplyDelete
  10. One more thing: it would be nice if search paths are local to function: i.e. if you set them, they get restored when you return. Or at least an option to do something like this.

    ReplyDelete
  11. To answer my own question, you can set the path explicitly when you create the function, so with:

    CREATE FUNCTION check_password(uname TEXT, pass TEXT)
    RETURNS BOOLEAN AS $$
    ...
    $$ LANGUAGE plpgsql
    SET search_path = admin, pg_temp;

    Brilliant!

    ReplyDelete
  12. Berend it would be brilliant but it is not that easy. It should be automaticly set like:

    SET search_path=fun_schema||','||search_path.

    Changes to search path should be reverted after end of function.

    The other thing is that one shouldn't have objects in fun_schemas that have the same names like other objects in other schemas.

    So we're going to the start point that objects in all schemas should have different names.

    The best way would be implementing packages for storing functions and temp tables like session variables in those packages.

    Greetings!

    ReplyDelete
  13. @tomasz, i'm not sure i understood you correctly, but any SET option to a CREATE FUNCTION statement is set at the function entry point and restored to its original value when leaving the function. However, this is available since 8.3 only.

    ReplyDelete
  14. I must unfortunately disagree. The current way search_path works is extremely useful for various table partitioning strategies. For example, you can distribute data for different companies in their own private schema.

    By doing this and also manipulating the search_path function, only have to maintain one copy of your functions that can access any company's data...but only one at a time. No messy dynamic sql or dealing with the incredibly awkward and limiting built in table partitioning feature.

    I submit that for search_path to change or be restricted, table partitioning has to be sufficiently advanced so that it can displace it in normal usage. I think we are a long way from that...TP is mostly useless except for a very narrow range of applications.

    ReplyDelete
  15. I like search paths and think the feature should stay, however it should also be changed to work locally rather than globally, and should be treated as a data-definition operation. That is, a search path should be like a macro declaration, which basically just has an effect at "compile"/"data-definition" time. By this I mean, by default at the global level the search path is empty, meaning nothing can be found except by being called fully-qualified. And then, optionally, within each local context, eg per source file or per schema or per schema object definition, a search path can be declared which only has any effect on entity references from within that context. This way, for example, if 2 users each own their own schema, and declare their own search paths, this is like a macro where other objects declared in the same schema that refer to things in an unqualified manner, behind the scenes become fully-qualified as defined by their own search paths. External users of the schema objects don't have to set search paths in order for the internals of the schema objects they use to work correctly, they would only set their own search paths so that their first hop works. So if user A defines a routine foo() and that invokes bar(), the definer of foo() is the only one that has a say what bar() resolves to, except for the owners of the schemas that are in the search path set by user A for their own schema, and ideally they'd keep that as narrow as possible, typically just to include the same schema as the search path so only local stuff resolves.

    ReplyDelete
  16. I have to admit, as a business person considering moving my dependencies to Postgres, hearing these discussions give me the 'willies'. First because it's got the open source 'design by committee' process interwoven in it like a sticky ball of worms in a bait can. And then because it reveals 'the most advanced open source database' as 'half-baked' in its primary sense of 'not done'. I got here because I tried to specify 'use database' in a postgres query and then was stunned to hear google say, 'you have to create a new connection'. The meaning of 'flexible' is completely dependent on context: if it means you have to grasp the entirety of all lower level concepts, syntaxes, and techniques to utilize the 'flexibility aspect', flexibility is simply granularity in a prom dress. What these discussions mean is that usability is simply not a concern. And that consequence appears to be the result of an intense impression of the inevitable attraction to the existing syntax by the mature development community based primarily on familiarity (dependency). The bottom line is certainly that the 'postgresness' of postgres is in the loose and amorphous body of code that is exposed for modification and review underneath each of the syntax elements of the ANSI SQL Standard. The point of advancing a technology is to wrap up that messy stuff in neat, meaninful, and syntactically composable features that increase usability, obviousness, and business value propositions. To me, Robert Haas of EDB is the only person who has said anything that makes sense: https://www.youtube.com/watch?v=zsF1vfHBMBI.

    ReplyDelete