Thursday, November 5, 2009

Python in the Scientific World

Yesterday I attended a biweekly meeting of an informal a UC Berkeley group devoted to Python in science (Py4Science), organized by Fernando Perez. The format (in honor of my visit) was a series of 4-minute lightning talks about various projects using Python in the scientific world (at Berkeley and elsewhere) followed by an hourlong Q&A session. This meant I didn't have to do a presentation and still got to interact with the audience for an hour -- my ideal format.

I was blown away by the wide variety of Python use for scientific work. It looks like Python (with extensions like numpy) is becoming a standard tool for many sciences that need to process large amounts of data, from neuroimaging to astronomy.

Here is a list of the topics presented (though not in the order presented). All these describing Python software; I've added names and affiliations insofar I managed to get them. (Thanks to Jarrod Millman for providing me with a complete list.) Most projects are easily found by Googling for them, so I have not included hyperlinks except in some cases where the slides emphasized them. (See also the blog comments.)
  • Fernando gave an overview of the core Python software used throughout scientific computing: NumPy, Matplotlib, IPython (by Fernando), Mayavi, Sympy (about which more later), Cython, and lots more.
  • On behalf of Andrew Straw (Caltech), Fernando showed a video of an experimental setup where a firefly is tracked in real time by 8 camaras spewing 100 images per second, using Python software.
  • Nitimes, a time-series analysis tool for neuroimaging, by Ariel Rokern (UCB).
  • A comparative genomics tool by Brent Pedersen of the Freeling Lab / Plant Biology (UCB).
  • Copperhead: Data-Parallel Python, by Bryan Catanzaro (working with Armando Fox) and others.
  • Nipype: Neuroimaging analysis pipeline and interfaces in Python, by Chris Burns (http://nipy.sourceforge.net/nipype/).
  • SymPy -- a library for symbolic mathematics in Pure Python, by Ondrej Certik (runs on Google App Engine: http://live.sympy.org).
  • Enthought Python Distribution -- a Python distro with scientific batteries inluded (some proprietary, many open source), supporting Windows, Mac, Linux and Solaris. (Travis Oliphant and Eric Jones, of Enthought.)
  • PySKI, by Erin Carson (working with Armando Fox) and others -- a tool for auto-tuning computational kernels on sparse matrices.
  • Rapid classification of astronomical time-series data, by Josh Bloom, UCB Astronomy Dept. One of the many tools using Python is GroupThink, which lets random people on the web help classify galaxies (more fun than watching porn :-).
  • The Hubble Space Telescope team in Baltimore has used Python for 10 years. They showed a tool for removing noise generated by cosmic rays from photos of galaxies. The future James Webb Space Telescope will also be using Python. (Perry Greenfield and Michael Droettboom, of STSCI.)
  • A $1B commitment by the Indian government to improve education in India includes a project by Prabhu Ramachandran of the Department of Aerospace Engineering at IIT Bombay for Python in Science and Engineering Education in India (see http://fossee.in/).
  • Wim Lavrijsen (LBL) presented work on Python usage in High Energy Physics.
  • William Stein (University of Washington) presented SAGE, a viable free open source alternative to Magma, Maple, Mathematica and Matlab.
All in all, the impression I got was of an incredible wealth of software, written and maintained by dedicated volunteers all over the scientific community.

During the Q&A session, we touched upon the usual topics, like Python 3 transition, the GIL (there was considerable interest in Antoine Pitrou's newgil work, which unfortunately I could not summarize adequately because I haven't studied it enough yet), Unladen Swallow, and the situation with distutils, setuptools and the future 'distribute' package (for which I unfortunately had to defer to the distutil-sig).

The folks maintaining NumPy have thought about Python 3 a lot, but haven't started planning the work. Like many other projects faced with the Python 3 porting task, they don't have enough people who actually know the code base well enough do embark upon such a project. They do have a plan for arriving at PEP 3118 compliance within the next 6 months.

Since NumPy is at the root of the dependency graph for much of the software packages presented here, getting NumPy ported to Python 3 is pretty important. We briefly discussed a possible way to obtain NumPy support for Python 3 sooner and with less effort: a smaller "core" of NumPy could be ported first, which would give the NumPy maintainers a manageable task, combined with the goal of selecting a smaller "core" which would give them the opportunity for a clean-up at the same time. (I presume this would mostly be a selection of subpackage to be ported, not an API-by-API cleanup of APIs; the latter would be a bad thing to do simultaneous with a big port.)

After the meeting, Fernando showed me a little about how NumPy is maintained. They have elaborate docstrings that are marked up with a (very light) variant of Sphynx, and they let the user community edit the docstrings through a structured wiki-like setup. Such changes are then presented to the developers for review, and can be incorporated into the code base with minimal effort.

An important aspect of this approach is that the users who edit the docstrings are often scientists who understand the computation being carried out in its scientific context, and who share their knowledge about the code and its background and limitations with other scientists who might be using the same code. This process, together with the facilities in IPython for quickly calling up the docstring for any object, really improves the value of the docstrings for the community. Maybe we could use something like this for the Python standard library; it might be a way that would allow non-programmers to help contribute to the Python project (one of the ideas also mentioned in the diversity discussions).