Physics Software Rant
Caution: This page has been resurrected (many thanks to the Wayback Machine) from 2008. I have not yet had time to go through and update links or my biographical information, nor to check whether the details of specific cases presented in this article are still accurate. It would also be nice to add additional sections based on cases I have come across more recently, such as Don't be intentionally hostile to third-party maintainers and distributors of your software, lest they give up. Nevertheless, it still seems worthwhile to make available again.
As a graduate student post-doc in physics, developer of some small Free Software software projects, maintainer of the Debian packages of CERNLIB, and system administrator of a data analysis cluster, I feel well-qualified to state that most Unix-based physics software produced by research organizations fails to meet even the simplest expectations one might have for quality software. Let me clarify that I am not referring to the actual code, which is generally quite good, and is a testament to the skills and intelligence of the authors. No, I am talking about how the process of compiling and installing a well-reputed piece of physics software is fraught with confusion, hassle and worse. There is absolutely no excuse why it should be this way.
In this rant I will point out good software packaging practices, and illustrate how well-known pieces of physics software violate them. It is my hope, although not my expectation, that the authors of the software named below will be shamed into fixing it.
First and most importantly — Choose a License!
In order for people to download your source code and use it, they need your permission. This is not a difficult concept; it comes from the basics of copyright law. If your software does not have a license (which provides this permission), no one can legally use it and you may as well not even have published it. I couldn't even find anything pretending to be a license for Pythia or most other Monte Carlo generators.
Some popular Free Software [1] licenses, for reference, include the GNU General Public License, the GNU Lesser (or Library) General Public License, the BSD license, and the Apache License. A variant of the last may be desirable for physics projects which are worried about third-party modifications reflecting badly on them; people who redistribute modified versions of Apache are required to get written permission from the authors in order to call their derived product by the Apache name. For those who want to write a Free license for their project, GNU's annotated list of licenses and the Debian Free Software Guidelines (complete with FAQ) should provide a lot of food for thought.
Examples of licenses that are not Free can be found in the EULA (End User License Agreement) of most mainstream commercial software. Licenses such as that of EGSnrc and the original license of ROOT do not qualify as Free Software, either, even though the source code is available. The former prohibits use of the software for commercial purposes, and the latter prohibited distribution of modified copies. (ROOT has since moved to using the LGPL; many thanks to the authors for this license change!) Be aware that if you want your software to be widely used, distributed, and improved by third parties, and if you intend to make the software available at no cost anyway, having a Free license is an excellent way to promote these goals.
In addition to having a license, you must avoid violating that of anyone else's software. This may result in you being sued; in some countries it may even be a criminal offense. Some examples in the Hall of Shame are the inclusion of FLUKA and Pythia/Jetset code in the nominally GPL-licensed CERNLIB. Watch out as well that you aren't accidentally including restrictive licenses from long ago in parts of your code, such as Matrix.h in CLHEP.
Don't forget to license your documentation properly too! Because of the lack of said license, I cannot include all of the detailed CERNLIB documentation in the Debian packages of CERNLIB, much as I would like to; I can use only the small amount that can be found in the explicitly licensed source tree. In choosing a license for your documentation, you should be aware that, although the GNU Free Documentation License is popular, there is some controversy about whether it is truly Free. Many members of the Debian Project have suggested that the best course of action is to provide your programs and their documentation under the same license. This makes it easy for downstream packagers and bug fixers to legally move code and comments back and forth between programs and docs.
Finally, be aware that you (as an individual) and your employer may have very different goals with respect to licensing your project. Some contracts even have the employer claiming ownership over your code written during off-hours if it is in any way related to your job — check on this. Even if you don't have such an onerous contract, if the project is being done for work-related purposes, you should make sure that your employer's views on licensing are aligned with yours.
Put the Software into One File
There is no need to make anyone download ten different tar.gz files in order to have everything needed to compile your software. Just put everything into ONE tar.gz file! It isn't as though downloading ten 5-MB files takes any less time than one 50-MB file. But it is a lot more annoying.
Don't take this advice to extremes, however. You should NOT include readily available third-party packages in your tarball, as ROOT does with Freetype and libAfterImage, especially if your software and the third-party package(s) have different licenses. This is just asking for trouble.
On a related note, it is intensely irritating to find that people have created a tar.gz file like this one, that spews its contents into the current directory without creating a subdirectory to hold them. And the naming convention for that subdirectory is softwarename-x.y.z, where x.y.z is the version of the software. Not just softwarename; that will overwrite any old copies of your software that a user may have unpacked in the same directory.
Make Your Configure Script Do the Right Thing Automatically
These days, we are beyond the dark ages of "Edit this Makefile so that this software compiles on your computer." Thanks to the GNU project, we have a wonderful program called "Autoconf" that creates a configure script for you that will automatically determine what sort of system your program is being compiled on. All you need to do is write some macros testing for features needed by your software to compile. Until recently, ROOT users running Linux on an iBook were somehow supposed to know that they needed the "linuxppcegcs" argument to ./configure. (Huh? The egcs project doesn't even exist separately from gcc these days!) Even if you abhor Autoconf, your configure script must NOT require arguments to specify the architecture unless the user wants to cross-compile. In the worst case, put a case statement in the configure script and make use of uname.
Autoconf works best when used with its sister program, Automake. This program removes much of the pain of writing Makefiles. Note that neither Autoconf nor Automake, despite being GNU software, put any constraints on the choice of license for your software. Try them, you might like them! Your users surely will be happier, for their installation will be reduced to the simple three steps of "./configure --with-some-options ; make ; sudo make install". Compare this with some current several-page, complicated instructions for compiling / installing.
However you write a configure script, be sure to test it for various cases, such as the source, build, and install directories all being different. Turns out that (at least with Geant4 version 6.0), running its ./Configure -install bombs out if you tell it to install anywhere other than the current source directory. It also makes another big mistake, that of trying to install compiled source code into the target directory immediately. The "make" and "make install" steps noted above are separate so that a system administrator can compile code and test it out as a user, needing to become root only for the install step.
Consider Portability
Most components of CERNLIB are completely broken on modern 64-bit architectures like the Opteron or Itanium. The reason is that CERNLIB's authors were not far-sighted enough to consider the possibility of sizeof(void *) being larger than sizeof(int), and the assumption that a pointer can be stored into an int is so deeply built into the CERNLIB architecture that it is essentially impossible to fix. Since it seems likely that AMD and Intel's 64-bit chips will eventually replace 32-bit chips in the PC marketplace, this means that CERNLIB is condemned to gradual obsolescence. (Yes, there exists a patch that gets CERNLIB to work—mostly—on AMD64. The patch is basically a gigantic hack that takes advantage of the fact that although AMD64 pointers are all 64-bit, function and static data pointers have values less than 232. It does NOT work on Itanium or Alpha architectures, and only works on AMD64 for statically linked CERNLIB apps.) Do not store pointers into ints!!!
Another important factor to consider is the issue of endianness. Basically, different CPUs store variables of more than one byte differently; full details can be found in this article. This will make a difference any time you transmit data over a network (where the data format should be "big-endian"), store data into a file in binary format (where the specific endianness doesn't matter, but must be kept consistent for the file to be usable on different platforms), or retrieve data from a binary file. Thankfully, most physics applications I am aware of get this one right.
Still another platform-dependent issue is that of alignment. Some CPUs require that various data types be aligned to memory locations that are multiples of four bytes, while some other CPUs and data types require multiples of eight. In practice, alignment issues sometimes require the compiler to add "padding" between elements of a C struct, C++ class, or FORTRAN common block, so if you aren't careful, the size of the data structure may vary between platforms. To prevent this from happening, you need to construct the data structure in a way that fulfills the most stringent possible alignment requirements, which perhaps will involve dummy members. Or, you can accept that the size will be platform-dependent, and code in a way that doesn't make assumptions about it. PAW compiled for Linux on the Motorola 680x0 platform gave me bizarre results until I built it with the -malign-int flag, forcing all integers to be aligned at multiples of four bytes; in this case, the code had more stringent alignment requirements than the CPU!
These three points only scratch the surface of portability issues, but be aware at least of them, from the very beginning of the project, when writing code. Even if you never intend to release your project publicly, who knows what platforms everyone may be using in the future; failing to consider portability may make it difficult or impossible to migrate to them.
Library Issues
It seems that every physics software package comes with its own program for calculating library dependencies — root-config for ROOT, cernlib for CERNLIB, liblist for Geant4... And each one works differently. (The cernlib script was so problematic that I rewrote it completely for the CERNLIB Debian packages.) Libtool and pkg-config make dependency calculation a solved problem! Libtool, as an added benefit, is meant to work together with Autoconf / Automake. Of course, Libtool and pkg-config weren't around when many pieces of physics software were written, but there is no time like the present to adapt.
Another common problem is that no one seems to know how to make a shared library. When a library is available only in static form, every time a program links against it, more disk space is wasted. Furthermore, every time a bug fix is made to a static library, you have to recompile every program depending upon it to take advantage of the bug fix. Libtool, happily, makes shared library creation much easier than it used to be. For more information on this subject, its manual is very useful, even if you aren't using Libtool. The Linux Program Library HOWTO should also prove informative.
When building shared libraries, you should learn about how soname versioning works, and what kinds of changes break library ABI compatibility (in C++, just about any change involving a class definition will do so). Your users will then be able to upgrade to newer versions of your library in a controlled manner without unexpectedly breaking their already compiled applications. (The Geant4 folks don't seem to take care about this yet. Not only do they not use soname versioning, but even version 4.8.0 + patch01 has a different C++ ABI than the original version 4.8.0, due to changes such as this one [see lines 65-72 in the right-hand column].)
Generally the library soname versions should be changed independently of the software version number. Recent releases of CLHEP have unfortunately bumped the soname version even when the only changes since the last release have been minor configure script updates. This breaks ABI compatibility for no reason whatsoever. Admittedly, however, it is hard to separate software and soname versioning properly when building a large number of libraries from the same source, as ROOT and Geant4 do. More commentary on library versioning issues, from the point of view of people who package software professionally (albeit rather Linux-centric), is provided at the Debian Library Packaging Guide.
Don't Install Unnecessary Files
Do I really need all the stuff in geant4.6.0/hadronic_lists after Geant4 is finished compiling? Who knows? But I dare not delete it in case it's used by some obscure library. EGSnrc is even worse, with hundreds of undocumented little shell scripts that may be useful for something; then again, they may not. (This seems to have improved in EGSnrc version 4.) Your software should consist of binaries, libraries, documentation, data files, and possibly include files for development, not everything plus the kitchen sink. If having hundreds of little shell scripts is unavoidable, it's time to redesign your software.
Do NOT by default install your source code if the user is already compiling from source!
Don't Require Specific Files in Users' Home Directories
In other words, don't try to put the user in a box. There is really no excuse for making me create an unwanted directory "egsnrc" in my home directory. Yes, forcing the user to put all their EGSnrc code in that directory makes writing the EGSnrc shell scripts easier. No, it isn't acceptable. I should be able to run your code in whatever directory (where I have write permissions) that I want! End of story.
If you really need to save something to a user's directory, put it in a dotfile directly under $HOME, as is the accepted convention, so I don't have to look at it. Temporary files should go in $TMPDIR, or /tmp if that isn't defined. (But take care to create temporary files securely!) Cluttering up directories by putting little "last.kumac" and "paw.metafile" files everywhere is not at all appreciated, especially when they manage to migrate into CVS.
On the converse side, why can't the Geant4 examples be self-contained by default? Without specifying a whole bunch of environment variables, they reference non-existent files in "../../..", and if only $G4INSTALL is defined, try to install things to system directories. Speaking of which...
Don't Depend on Environment Variables
Please don't make your users pollute their environment namespace with ten thousand environment settings like Geant4's $G4INSTALL or (even worse) EGSnrc's non-descriptive $HEN_HOUSE. Instead of depending on environment variables, your binaries should know where to find their required data and libraries by having the appropriate paths compiled in. (These paths must of course be settable at compile time, e.g. with ./configure --prefix=/opt.) Use of files in /etc, as ROOT does, might be an acceptable alternative. If you absolutely can't do either (and why not?), your binaries should at least work in the default case. By default, libraries ought to go in /usr/local/lib, data files in something like /usr/local/share/software, man pages in /usr/local/man, etc. You may find the Filesystem Hierarchy Standard useful.
Cooperate with Packaging Systems
Although not required, it would be nice if you provided RPM or deb format packages of the Linux versions of your software. This allows your software to play nicely with the programs already installed on someone's Linux workstation. Through intelligent use of dependencies, it also allows you to enforce the library requirements of your software. Even better, if you set up a repository that can be read by programs like apt-get, apt-rpm or yum, users of your software will automatically get the latest and greatest version every time you update the repository. Don't forget to announce your repository on rpmfind.net or apt-get.org.
Don't make the mistake, as Intel did with its icc compiler [I admit this example is not physics-related], of creating a bunch of RPMs, and then writing a wrapper shell script around them to install them. RPMs (and debs) don't NEED an install script, that's what rpm and dpkg are for!
Conclusion
Although I have been harsh, it was not my intent to insult anyone. I am certainly grateful for the wide variety of free physics software available on the Internet. I just wish that, after spending man-years developing their software, people would take a few extra days putting finishing touches on it to avoid problems like those discussed above. This would go a long way towards making physicists and sysadmins everywhere happy, and wasting a lot less of their time.
Copyright
In the interest of spreading the information on this page, I'm licensing it under essentially the BSD license, minus the no-op advertising clause and meaningless legal disclaimer. Please note this license does not apply to other material on my web site.
Specifically: this HTML file is © Copyright 2004, 2005, 2006, 2008 Kevin B. McCarty. Redistribution and use in any form, with or without modification, are permitted provided that the following conditions are met:
- Redistributions in HTML or plain text format must retain the above copyright notice and this list of conditions.
- Redistributions in any other form must reproduce the above copyright notice and this list of conditions in the documentation and/or other materials provided with the distribution.
Footnotes
1. Maybe I do not need to say so explicitly at this point in time, but when I refer to "Free Software" with the capital letters, I do not mean software that merely is available at no cost (free as in "free beer"). Instead I mean software whose source code is available and for which anyone can redistribute modified copies of the source code as well as binaries compiled from those modified copies—Free as in Free Speech.
2. The basic difference between the GPL and LGPL is that a proprietary software program may link against a library licensed under LGPL, but not a library under GPL. However, a proprietary program may not directly include or derive from either GPL or LGPLed code.