Thursday, June 19, 2008

Schema design and ID fields

I'm sure everyone has created something like this once:
CREATE TABLE a (id int PRIMARY KEY, value text);

CREATE TABLE b (id int PRIMARY KEY, data text, a_id int REFERENCES a (id));
CREATE TABLE c (id int PRIMARY KEY, a_id int REFERENCES a (id), moredata text);

They all have id columns, but they are all different sets of IDs. Moreover, the same set of IDs is sometimes knows as "id", sometimes as "a_id". Some time ago I come to the conclusion that this naming scheme is bad. It makes sense locally within the table, because a.a_id would be redundant. But I think it has advantages to be redundant here anyway.

The first advantage is that you can use this join syntax:
a JOIN b USING (a_id) JOIN c USING (a_id)

If you build denormalized intermediate tables (materialized views) of this type
CREATE TABLE a_b AS SELECT * FROM a, b, c WHERE = b.a_id AND = c.a_id
it will fail because the join will have multiple columns named a_id. Unless using the USING join syntax, it won't know they are actually the same. So you would have to write out the column list instead of the asterisk in the select list.

Also, once your SQL queries become more complicated, say using table aliases and subselects, it is easy to lose track of what means in a particular context.

For all of these reasons, the general rule arises that a column name should be globally unique within a database schema. This should actually be checkable automatically if you allow duplicates only for columns connected by foreign key constraints.

There is another idea you can play with while developing your database schema. You start by creating (conceptually) one big table containing all the columns that are needed to store your data, which obviously requires globally unique column names. Then you use the decomposition algorithm to normalize this table down to the individual tables that you actually implement. Some might find this preferrable to converting ER diagrams.

Friday, June 13, 2008

Debian Release Goals

I have been tracking the Debian release goals progress for a while now, through bug-squashing parties and using various iterations of wiki pages. It's time to write down some thoughts.

First of all, I think the concept of release goals is a fantastic idea. It may be one of the most important conceptual moves in Debian of late. It allows developers to implement distribution-wide changes without having to negotiate with every package maintainer individually and without having to go through the vicious no-change-without-policy/no-policy-without-established-practice maze. It moves the focus back on operating system development instead of package hoarding.

In the future, I think we should come up with some "sexier" release goal descriptions. If you look over the current list, many of these tasks are implementation-level, without any obvious benefit to users: double-build support? no unmet recommends? Or they have incomprehensible descriptions. The pretty much only release goal that has a real user benefit, namely the piuparts cleanness, ensuring clean distribution upgrades, is both the most cryptic and actually the most neglected one in terms of development attention. Don't get me wrong, all these goals are very valid. For better or worse, the press has started to pick up these release goals to put in their various Debian release candidate timeline pre-announcement articles, and next time around we should have something cool in there. If armel support or KDE 4 integration had been release goals, the marketing effect would be better.
The next thing is tracking bugs. If you have some time on your hands, how do you find a release goal bug that you can help with? I had worked up this wiki page to help myself, but it is still too weird. The fact that exists shows that the BTS is lacking features or is too hard to use or both. I'm thinking, we should regularly copy the whole BTS into an SQL database and make queries from there. Think aggregation, grouping, full-text search! I'm not familiar with the BTS internals or the LDAP gateway, but I know my SQL. So if anyone knows the other end of this deal, I'd love to chat about setting something like this up.

So then, what happens to the release goals after the release? Has this been thought through?

Obviously, some release goals become technically self-enforcing. Python 2.5 and GCC 4.3 are now the default, and anything that doesn't work with these will have permanent trouble in the future.

Release goals that implement a distribution-wide feature, such as the LSB-based init system enhancements, will really need to become part of policy after the release. Otherwise they would revert to being normal bugs without any NMU powers and the other special attention given to release goals. So over time, the support for the former release goals across packages would again decline. To keep it up, you'd either have to repeat the release goal indefinitely, which would be silly, especially when there are no problems right now, or you'd have to make it a part of policy to enforce it in the future. Essentially, you'd have to consider most release goals trial policy changes, which become permanent if successful.

Near-examples of the above are the eternal release goals of IPv6 support and LFS support. We really have these under control quite well now -- although I still wouldn't dare to call Debian or any general-purpose Linux distribution fully IPv6-ready -- but nowadays these problems are almost all upstream code problems that Debian can't really fix very well on the packaging side. So under the current approach these release goals will exist forever, or worse, will be dropped from release goal status.

And then of course, some release goals fail to be reached. These should reapply for the next release, if interest persists. The dash goal comes to mind.

By the way, we had 13 release goals. 3 are completely done, 5 are pretty much done (less than 10 bugs), 1 more looks like it could get pretty much done before the release, and 4 will most likely fail to be accomplished.