Saturday, May 23, 2009

PGCon 2009: Fourth Day - The File System Strikes Back

Leading topics among the talks this year at PGCon were evidently database version control, testing, and deployment methodology questions and answers, as I mentioned yesterday, as well as "queueing". Although you need to realize that there are actually two separate applications of the word queueing floating around: one being the message broker related usage, the other related to queueing theory for predicting database performance. Anyway, if you are queueing, you are doing something right this year. In general, I am amazed every year how many participants appear to have the same set of issues, and then a completely different same set of new issues the next year. So at least we are either all completely off track or mostly on the same right track.

In the lightning talks I announced the availability of postgresqlfs, a small project of mine that I theorized about many years ago and which I finally managed to implement within basically two days. If you liked PL/sh, this is the new deal for you. In fact, postgresqlfs allows you to execute your PL/sh functions directly from the shell, which is what you probably should have done in the first place. ;-)

Friday, May 22, 2009

PGCon 2009: Third Day - A New Hope

So this is going to be the new world order. Check your database into a version control system. No, not your .sql files, your database! Check it out, do changes, check it in. Wait, before you check it in, run your test suite. No, not your application test suite, your database test suite! Kind of how you develop your other code, right? Right? Together, Post Facto and pgTAP, and the spirit they represent, might be the most sensible things since the invention of the file system. And there appear to be one or two or three more talks on the program about comparing and consolidating and de-messing your databases, so that appears to be a theme this year. It's about time we organize this. Thanks guys.

Thursday, May 21, 2009

PGCon 2009: Second Day

The developer meeting turned out to be very useful, I thought. We decided to divide the PostgreSQL community into two groups for the next release cycle: one group works on hot standby, one group on synchronous replication. Everyone, please pick a camp and help out. These features are arguably the top adoption issues for PostgreSQL now, and we don't have enough people working on them.

My body clock is still out of whack. I wake up at 5 in the morning. Seems to be a common problem among Europeans, I gather. A good time to hack. Slony-I versions 1.2.16 and 2.0.2 are now uploaded to Debian. There you go.

Wednesday, May 20, 2009

PGCon 2009: First Day

So we're back in Ottawa once more. The trip has been getting smoother over the years. Back in the residence tower overlooking the city. Met some friends and colleagues at the Royal Oak. Noted that they sell Coors under Canadian beers. Baseball on TV. Mmh ... baseball ...

Before everyone asks: I'm not running the Ottawa Marathon or anything else this year. I just ran a half-marathon in Helsinki last week, and that is enough for this month.

The developer meeting is next. I am ready for six and a half hours of Git and anti-Git bashing. ;-)

Monday, May 18, 2009

Regression test code coverage reports

I have been collecting monthly PostgreSQL regression test code coverage reports at <http://developer.postgresql.org/~petere/coverage/>. So if you are wondering what this thing is but haven't had the courage to try it out yourself, there is your chance. (Hmm, buildfarm integration could be nice, someday.)

We have had a line coverage rate of about 66% steadily for a few months now (well, it's feature freeze). The lcov tool labels that as "green" (=good). The new version of lcov, which I have in use as of the April report, also reports function coverage, where we have about 73%, which lcov labels as "red" (=bad).

For the next release cycle, I have two goals in this area: First, expand the test coverage reporting to the entire source tree, not only the backend. And second, improve the test coverage of various neglected areas. There is reduced coverage, for example, in the areas of non-btree indexes, vacuuming, recovery, GEQO; and once we analyze other parts of the source tree, we will probably find gaping holes there.

Wednesday, May 13, 2009

The Big Shots

As the occasional thinker about open-source development practices, communities, and issues, I have been wondering for a while: What are the largest open-source projects? What projects have the most code, the most users, and the most issues to deal with, and how do they cope?

The Debian archive should provide some insights into the first one or two questions, as it contains a very large portion of all available and relevant open-source software and exposes them in a fairly standard form. In the old days one might have gotten out grep-dctrl to create some puzzling statistics, but nowadays this information is actually available in an SQL database: the Ultimate Debian Database (UDD). (And it's in PostgreSQL. And it comes with a postgresql_autodoc-generated schema documentation. Excellent.)

So here is a first question. Well, the zeroth question would have been, which source packages have the largest unpacked orig tarball, but that information doesn't seem to be available, either via UDD or via apt. So the first question anyway is, which source packages produce the largest installation size across all their binary packages:
udd=> SELECT source, sum(installed_size)/1024 AS mib FROM packages WHERE distribution = 'debian' AND release = 'sid' AND component = 'main' AND architecture IN ('all', 'i386') AND section <> 'debug' GROUP BY source, version ORDER BY mib DESC LIMIT 30;                                                                                   
source | mib
------------------+------
openoffice.org | 1797
kde-l10n | 648
gcj-4.4 | 544
vtk | 465
linux-2.6 | 404
openclipart | 353
vegastrike-data | 311
ghc6 | 308
gclcvs | 303
wesnoth | 300
fpc | 269
axiom | 256
webkit | 255
gcc-snapshot | 255
lazarus | 241
kdebase-workspace | 226
plt-scheme | 221
torcs-data-tracks | 219
scilab | 213
openscenegraph | 211
eclipse | 210
sagemath | 201
insighttoolkit | 198
acl2 | 195
kdebindings | 181
atlas | 165
gcl | 163
trilinos | 153
paraview | 153
asterisk | 144
(30 rows)
This produces a few well-known packages, but also a number of obscure ones. If you look closer, many of them appear to be themed around scientific, numerical, visualization, Scheme, Lisp, that sort of thing. Hmm.

Here is another idea. Take a package's installation footprint and multiply it by its popularity contest installation count. So you get some kind of maintenance effort score, either because the package is large or because you have a lot of users or both.
SELECT rank() OVER (ORDER BY score DESC), source, sum(installed_size::numeric * insts) AS score FROM packages JOIN popcon USING (package) WHERE distribution = 'debian' AND release = 'sid' AND component = 'main' AND architecture IN ('all', 'i386') GROUP BY source, version ORDER BY score DESC LIMIT 30;
rank | source | score
-----+-----------------------------+-------------
1 | openoffice.org | 12638492332
2 | mysql-dfsg-5.0 | 3411344560
3 | eglibc | 3371485240
4 | perl | 3019183024
5 | evolution | 2669948000
6 | samba | 2308923872
7 | mesa | 1853902860
8 | texlive-base | 1684245516
9 | gcj-4.3 | 1610495484
10 | foomatic-db-engine | 1608178104
11 | foomatic-db | 1423947704
12 | inkscape | 1413910080
13 | qt4-x11 | 1258220636
14 | gcc-4.3 | 1248741312
15 | kdelibs | 1021058256
16 | gnome-applets | 998434136
17 | xulrunner | 958232688
18 | coreutils | 954766896
19 | openssl | 877067672
20 | ncurses | 827679424
21 | python2.5 | 815826384
22 | aptitude | 808161380
23 | gimp | 786015124
24 | gnome-utils | 781756328
25 | nautilus | 774319690
26 | openoffice.org-dictionaries | 761075576
27 | eclipse | 756072380
28 | dpkg | 736626200
29 | openclipart | 731244240
30 | wine | 707967500
(30 rows)
(Yeah, they run this thing on PostgreSQL 8.4 beta 1.)

I noticed linux-2.6 is suspiciously absent because of a low popcon score (?!?).

I don't want to dump the entire database into this blog post, but if you try this yourself you can look at about the first 200 to 300 places to find reasonably large and complex projects before it gets a bit more obscure. A few highlights:
  51 | gnupg                       |   455660464
59 | php5 | 386417572
60 | mutt | 381148176
83 | icu | 258602756
84 | xorg-server | 255186332
101 | exim4 | 224857700
107 | openssh | 215792828
113 | tar | 201520400
114 | postgresql-8.3 | 196844584
115 | libx11 | 195856564
116 | ruby1.8 | 194681656
272 | emacs22 | 62047476
This is obviously still biased in a lot of ways, but it does show the major projects.

The UDD is also an interesting use case that shows how you can deploy a PostgreSQL database as a semi-public service with direct access. A great tool, and a great tool to build other great tools on top of.

Europawahl

Ich habe letzte Woche die Briefwahlunterlagen zur Europawahl bekommen. Als Briefwähler hat man ja etwas Zeit, sich den Stimmzettel etwas genauer durchzulesen. So trifft man gelegentlich alte Bekannte wieder. Zum Beispiel sehe ich auf Platz 6 der Liste der CDU

Sabine Verheyen, Hausfrau, Aachen

Das ist die selbe Frau Verheyen, vermute ich ganz stark, die vor ein paar Jahren, also ich noch in Aachen wohnte, die Oberbürgermeisterwahl gegen

Jürgen Linden, Oberbürgermeister, Aachen

verloren hatte. Tipp vom einfachen Wahlvolk: Bei allem Respekt für Hausfrauen, schreiben Sie doch was anderes auf den Wahlzettel. Das sieht besser aus. Im Notfall so wie der Platz 2 auf der Liste DIE LINKE: "Angestellte". Geht doch.

Auf dem Stimmzettel befinden sich hier bei mir insgesamt 31 Wahlvorschläge. Vereinigungsfreiheit ist ja eine tolle Sache, aber man kann sich ja auch mal absprechen. Vielleicht könnten ja

10. Volksabstimmung, und
26. Für Volksentscheide

das nächste Mal gemeinsame Sache machen?

Und vielleicht finden sich auch unter

18. 50plus
22. Die Grauen
30. Rentnerinnen und Rentner Partei
31. Rentner-Partei-Deutschland

das nächste Mal Schnittpunkte?

Der eigentliche Knaller sind ja aber die Berufe der Kandidaten auf der Liste der Piratenpartei Deutschlands: Dipl.-Wirtschaftsmathematiker, Dipl.-Physiker, selbst. IT-Unternehmer, Informatiker, Geschäftsführer, Student, Student, Programmierer, Web-Entwickler, Consultant. Alles klar. ;-)

Dieses Mal gibt es übrigens wieder den Wahl-O-Mat. Den hatten Sie ja letztes Mal wohl gerichtlich verboten, weil er nicht alle Parteien berücksichtigt hatte. Dieses mal sind fast alle 31 dabei. Toll ...