Wednesday, May 13, 2009

The Big Shots

As the occasional thinker about open-source development practices, communities, and issues, I have been wondering for a while: What are the largest open-source projects? What projects have the most code, the most users, and the most issues to deal with, and how do they cope?

The Debian archive should provide some insights into the first one or two questions, as it contains a very large portion of all available and relevant open-source software and exposes them in a fairly standard form. In the old days one might have gotten out grep-dctrl to create some puzzling statistics, but nowadays this information is actually available in an SQL database: the Ultimate Debian Database (UDD). (And it's in PostgreSQL. And it comes with a postgresql_autodoc-generated schema documentation. Excellent.)

So here is a first question. Well, the zeroth question would have been, which source packages have the largest unpacked orig tarball, but that information doesn't seem to be available, either via UDD or via apt. So the first question anyway is, which source packages produce the largest installation size across all their binary packages:
udd=> SELECT source, sum(installed_size)/1024 AS mib FROM packages WHERE distribution = 'debian' AND release = 'sid' AND component = 'main' AND architecture IN ('all', 'i386') AND section <> 'debug' GROUP BY source, version ORDER BY mib DESC LIMIT 30;                                                                                   
source | mib
------------------+------
openoffice.org | 1797
kde-l10n | 648
gcj-4.4 | 544
vtk | 465
linux-2.6 | 404
openclipart | 353
vegastrike-data | 311
ghc6 | 308
gclcvs | 303
wesnoth | 300
fpc | 269
axiom | 256
webkit | 255
gcc-snapshot | 255
lazarus | 241
kdebase-workspace | 226
plt-scheme | 221
torcs-data-tracks | 219
scilab | 213
openscenegraph | 211
eclipse | 210
sagemath | 201
insighttoolkit | 198
acl2 | 195
kdebindings | 181
atlas | 165
gcl | 163
trilinos | 153
paraview | 153
asterisk | 144
(30 rows)
This produces a few well-known packages, but also a number of obscure ones. If you look closer, many of them appear to be themed around scientific, numerical, visualization, Scheme, Lisp, that sort of thing. Hmm.

Here is another idea. Take a package's installation footprint and multiply it by its popularity contest installation count. So you get some kind of maintenance effort score, either because the package is large or because you have a lot of users or both.
SELECT rank() OVER (ORDER BY score DESC), source, sum(installed_size::numeric * insts) AS score FROM packages JOIN popcon USING (package) WHERE distribution = 'debian' AND release = 'sid' AND component = 'main' AND architecture IN ('all', 'i386') GROUP BY source, version ORDER BY score DESC LIMIT 30;
rank | source | score
-----+-----------------------------+-------------
1 | openoffice.org | 12638492332
2 | mysql-dfsg-5.0 | 3411344560
3 | eglibc | 3371485240
4 | perl | 3019183024
5 | evolution | 2669948000
6 | samba | 2308923872
7 | mesa | 1853902860
8 | texlive-base | 1684245516
9 | gcj-4.3 | 1610495484
10 | foomatic-db-engine | 1608178104
11 | foomatic-db | 1423947704
12 | inkscape | 1413910080
13 | qt4-x11 | 1258220636
14 | gcc-4.3 | 1248741312
15 | kdelibs | 1021058256
16 | gnome-applets | 998434136
17 | xulrunner | 958232688
18 | coreutils | 954766896
19 | openssl | 877067672
20 | ncurses | 827679424
21 | python2.5 | 815826384
22 | aptitude | 808161380
23 | gimp | 786015124
24 | gnome-utils | 781756328
25 | nautilus | 774319690
26 | openoffice.org-dictionaries | 761075576
27 | eclipse | 756072380
28 | dpkg | 736626200
29 | openclipart | 731244240
30 | wine | 707967500
(30 rows)
(Yeah, they run this thing on PostgreSQL 8.4 beta 1.)

I noticed linux-2.6 is suspiciously absent because of a low popcon score (?!?).

I don't want to dump the entire database into this blog post, but if you try this yourself you can look at about the first 200 to 300 places to find reasonably large and complex projects before it gets a bit more obscure. A few highlights:
  51 | gnupg                       |   455660464
59 | php5 | 386417572
60 | mutt | 381148176
83 | icu | 258602756
84 | xorg-server | 255186332
101 | exim4 | 224857700
107 | openssh | 215792828
113 | tar | 201520400
114 | postgresql-8.3 | 196844584
115 | libx11 | 195856564
116 | ruby1.8 | 194681656
272 | emacs22 | 62047476
This is obviously still biased in a lot of ways, but it does show the major projects.

The UDD is also an interesting use case that shows how you can deploy a PostgreSQL database as a semi-public service with direct access. A great tool, and a great tool to build other great tools on top of.

2 comments:

  1. Another interesting count would be the total size of the package and its dependencies (with Recommended installed, minus the size of the Essential pacakges).

    ReplyDelete
  2. Pity there are so few Java projects in Debian. I bet there would be quite a couple of them that beat Eclipse in size.

    ReplyDelete