Thursday, November 5, 2009

Python in the Scientific World

Yesterday I attended a biweekly meeting of an informal a UC Berkeley group devoted to Python in science (Py4Science), organized by Fernando Perez. The format (in honor of my visit) was a series of 4-minute lightning talks about various projects using Python in the scientific world (at Berkeley and elsewhere) followed by an hourlong Q&A session. This meant I didn't have to do a presentation and still got to interact with the audience for an hour -- my ideal format.

I was blown away by the wide variety of Python use for scientific work. It looks like Python (with extensions like numpy) is becoming a standard tool for many sciences that need to process large amounts of data, from neuroimaging to astronomy.

Here is a list of the topics presented (though not in the order presented). All these describing Python software; I've added names and affiliations insofar I managed to get them. (Thanks to Jarrod Millman for providing me with a complete list.) Most projects are easily found by Googling for them, so I have not included hyperlinks except in some cases where the slides emphasized them. (See also the blog comments.)
  • Fernando gave an overview of the core Python software used throughout scientific computing: NumPy, Matplotlib, IPython (by Fernando), Mayavi, Sympy (about which more later), Cython, and lots more.
  • On behalf of Andrew Straw (Caltech), Fernando showed a video of an experimental setup where a firefly is tracked in real time by 8 camaras spewing 100 images per second, using Python software.
  • Nitimes, a time-series analysis tool for neuroimaging, by Ariel Rokern (UCB).
  • A comparative genomics tool by Brent Pedersen of the Freeling Lab / Plant Biology (UCB).
  • Copperhead: Data-Parallel Python, by Bryan Catanzaro (working with Armando Fox) and others.
  • Nipype: Neuroimaging analysis pipeline and interfaces in Python, by Chris Burns (http://nipy.sourceforge.net/nipype/).
  • SymPy -- a library for symbolic mathematics in Pure Python, by Ondrej Certik (runs on Google App Engine: http://live.sympy.org).
  • Enthought Python Distribution -- a Python distro with scientific batteries inluded (some proprietary, many open source), supporting Windows, Mac, Linux and Solaris. (Travis Oliphant and Eric Jones, of Enthought.)
  • PySKI, by Erin Carson (working with Armando Fox) and others -- a tool for auto-tuning computational kernels on sparse matrices.
  • Rapid classification of astronomical time-series data, by Josh Bloom, UCB Astronomy Dept. One of the many tools using Python is GroupThink, which lets random people on the web help classify galaxies (more fun than watching porn :-).
  • The Hubble Space Telescope team in Baltimore has used Python for 10 years. They showed a tool for removing noise generated by cosmic rays from photos of galaxies. The future James Webb Space Telescope will also be using Python. (Perry Greenfield and Michael Droettboom, of STSCI.)
  • A $1B commitment by the Indian government to improve education in India includes a project by Prabhu Ramachandran of the Department of Aerospace Engineering at IIT Bombay for Python in Science and Engineering Education in India (see http://fossee.in/).
  • Wim Lavrijsen (LBL) presented work on Python usage in High Energy Physics.
  • William Stein (University of Washington) presented SAGE, a viable free open source alternative to Magma, Maple, Mathematica and Matlab.
All in all, the impression I got was of an incredible wealth of software, written and maintained by dedicated volunteers all over the scientific community.

During the Q&A session, we touched upon the usual topics, like Python 3 transition, the GIL (there was considerable interest in Antoine Pitrou's newgil work, which unfortunately I could not summarize adequately because I haven't studied it enough yet), Unladen Swallow, and the situation with distutils, setuptools and the future 'distribute' package (for which I unfortunately had to defer to the distutil-sig).

The folks maintaining NumPy have thought about Python 3 a lot, but haven't started planning the work. Like many other projects faced with the Python 3 porting task, they don't have enough people who actually know the code base well enough do embark upon such a project. They do have a plan for arriving at PEP 3118 compliance within the next 6 months.

Since NumPy is at the root of the dependency graph for much of the software packages presented here, getting NumPy ported to Python 3 is pretty important. We briefly discussed a possible way to obtain NumPy support for Python 3 sooner and with less effort: a smaller "core" of NumPy could be ported first, which would give the NumPy maintainers a manageable task, combined with the goal of selecting a smaller "core" which would give them the opportunity for a clean-up at the same time. (I presume this would mostly be a selection of subpackage to be ported, not an API-by-API cleanup of APIs; the latter would be a bad thing to do simultaneous with a big port.)

After the meeting, Fernando showed me a little about how NumPy is maintained. They have elaborate docstrings that are marked up with a (very light) variant of Sphynx, and they let the user community edit the docstrings through a structured wiki-like setup. Such changes are then presented to the developers for review, and can be incorporated into the code base with minimal effort.

An important aspect of this approach is that the users who edit the docstrings are often scientists who understand the computation being carried out in its scientific context, and who share their knowledge about the code and its background and limitations with other scientists who might be using the same code. This process, together with the facilities in IPython for quickly calling up the docstring for any object, really improves the value of the docstrings for the community. Maybe we could use something like this for the Python standard library; it might be a way that would allow non-programmers to help contribute to the Python project (one of the ideas also mentioned in the diversity discussions).

35 comments:

Henrique said...

There is also InVesalius, a free medical software that reconstructs 3d images from the human body using MRI images and which is written in Python + VTK + ITK.

http://www.cti.gov.br/promed/software-en.htm

dcurtis said...

For our research in Computational Epidemiology at Iowa we are using a lot of python. One of our students created a multitouch display and simulation of the spread of disease in a hospital: demo

Noufal said...

There's an initiative funded by the Ministry of Human Resource Development in India called FOSSEE (fossee.in).

It's an attempt to showcase a variety of Open Source tools to engineers who are currently locked into expensive proprietary solutions. A large part of their offering is in Python and one of their main people is Dr. Prabhu Ramachandran who's the author of the scientific visualisation tool MayaVI (also in Python).

They involved in conducting a SciPy conference in India this December.

Sokolov Roman said...

i'm interesting copperhead. is there any resources/links (except ppt file with presentation) about this project? Maybe source code?

Joseph Turian said...

At our lab (the LISA LABO at the Universite de Montreal), we developed a Python tool in-house that we use for machine learning: Theano. It's like numpy but can optimize mathematical expressions and compile them to C code.

Guido van Rossum said...

Jarrod Millman blogged about my visit:

http://jarrodmillman.blogspot.com/2009/11/visit-from-guido-van-rossum.html

antalar said...

Another interesting tool is the PyReport: http://gael-varoquaux.info/computers/pyreport/

Briefly, it allows to use literate programming in Python. It is the only tool of this kind and it is very useful, especially in data analysis. Unfortunately, it's not supported for a long time. :(

Jarrod Millman said...

I sent a link to your blog post to the UCB py4science list. Thanks again for your visit. Fernando Perez is in meetings all day, but I am sure he will post links to the video and slides on his blog later today.

Matthew said...

Thanks for the visit. It made me think differently about Py3k: http://nipyworld.blogspot.com/2009/11/guido-van-rossum-talks-about-python-3.html

Alejandro said...

What's the link for PySKI? Google results point to this same page.

leon_sit said...

Copperhead seems to be a very interesting project. After few presentation that I found on the web, I think it works smoothly with the map function, lambdas, and etc not so much in the rest of the Python language. Cython already has a good coverage in what Copperhead is lacking right now. It would be very EXCITING to see Copperhead to be integrated with Cython.

Alex Clemesha said...

Related to http://live.sympy.org are the larger "code notebook" projects, like http://sagenb.org and http://codenode.org.

Both projects are like "Google docs for programming" and I think represent a missing piece in the Python scientific community.

Guido van Rossum said...

At the meeting I misrepresented Antoine Pitrou's newgil work. His code is here: http://svn.python.org/view/sandbox/trunk/newgil/ and his original description of the work is here: http://mail.python.org/pipermail/python-dev/2009-October/093321.html . The work does not remove the GIL or make Python multi-threaded or add more fine-grained locks. However it vastly improves the efficiency of the existing GIL code, removing the so-called "Dave Beazley effect" (see the post for details).

I don't know any links about PySKI, but it is an offshoot of something called OSKI (Optimized Sparse Kernel Interface): http://bebop.cs.berkeley.edu/oski/

cool-RR said...

I'm also working on a Python project in scientific computing, GarlicSim.

It's a framework for working with simulations. If you do any sort of work with simulations, I urge you to check out the readme (and the Introduction.)

Max said...

There will be a tutorial on "Python for High Performance and Scientific Computing" at Supercomputing 2009 and there are 112 registered participants already. We use Python for most of the analysis of Lattice QCD data.

koolhead17 said...

Rpy seems interesting too.http://rpy.sourceforge.net/
A python extension over R statistical programming language

Edvin said...

I work in s tructural bioinfomrtics and may add to the list:
The molecular modeling toolkit (MMTK): http://dirac.cnrs-orleans.fr/MMTK/
BioPython: http://biopython.org/wiki/Main_Page

I'd also like to mention Rpy, which I sometimes use for leveraging Rs plotting-packages from within python.

Personally I develop methods in structural bioinformatics and it is really great to have a language that minimize coding time when new ideas are to be tested out. I also like that I can develop code on my Macbook, yet my coworkers can check out and run without any hazzle on their linux-boxes and I can very easily move things over to our linux cluster whenever I need to run larger-scale analysis.

Père de Jumelles said...
This comment has been removed by the author.
Fernando said...

Thanks again for the visit, Guido, it was fantastic and the feedback I've received locally has been very good.

I've posted on my blog my take on the event:

http://fdoperez.blogspot.com/2009/11/guido-van-rossum-at-uc-berkeleys.html

and I also have now on my site all the slides from the presentations:

http://fperez.org/py4science/2009_guido_ucb/index.html

Finally, I should note that the whole session is available on video:

http://www.archive.org/details/ucb_py4science_2009_11_04_Guido_van_Rossum

Anne Carpenter said...

Our group is currently porting our open-source software projects to Python, making heavy use of SciPy and NumPy. We have been pleased with the process and the results. Our software is used in biomedical research for a wide variety of applications, most notably identifying the genetic basis and chemical treatments for diseases, based on screening microscopy images (www.cellprofiler.org).
* CellProfiler software for high-throughput image analysis (formerly Matlab-based)
* CellProfiler Analyst software for data exploration (formerly Java-based)

Paul F. Dubois said...

Python over C++ on massively parallel supercomputers -- who would have believed it 15 years ago? I am so proud of this community. And thank you, Guido, for listening to what we needed.

dalloliogm said...

Python is suffering a lot in the scientific word, because it has not a CPAN-like repository.

PyPI is fine, but it is still far from the level of CPAN, CRAN, Bioconductor, etc..

Scientists who use programming usually have a lot of different interests and approaches, therefore it is really difficult to write a package that can be useful to everyone.
Other programming language like Perl and R have repository-like structure which enable people to download packages easily, and to upload new ones and organize them withouth having to worry about having to integrate them into existing packages.

This is what is happening to biopython now: it is a monolitic package that it is supposed to work for any bioinformatic problem; but this is so general that to accomplish that you would need to add a lot of dependencies, to numpy, networkx, suds, any kind of library.
However, since easy_install is not as ready yet as the counterparts in other languages, if the biopython developers add too many dependencies, nobody will be able to install it properly, and nobody will use it.

Recession Cone said...

I'm Bryan Catanzaro, one of the authors of the Copperhead project. It was great to be able to talk briefly about our work during Guido's visit, and thanks for the interest expressed in the comments. We are planning on releasing the source code for Copperhead in the next few months, when it's not quite so raw. Feel free to email me directly with any questions.

robince said...
This comment has been removed by the author.
robince said...

Python's also gaining increasing momentum in neuroscience, where it's competing mainly with MATLAB. See for example the recent special issue of Frontiers in Neuroinformatics, as well as workshops at major conferences and summer schools in the field.

Script Maven said...

Did you see any reports about using PyMPI or other MPI versions of Python to control large parallel jobs?

Travis said...

Thanks for continuing to engage the Scientific Community. I haven't had nearly the time I would have liked to continue to help push the Python/NumPy linkage. Fortunately, there are other great people continuing the work.

olenz said...

I would like to add another scientific tool under development that will heavily use Python: The Molecular Dynamics simulation package ESPResSo++. The predecessor ESPResSo used Tcl/Tk as a scripting frontend language. Thanks to Guido for the great advantages that Python has over Tcl/Tk!

eli said...

It is very exciting to see a free open-source python becoming the new standard language used and taught in science.
In my opinion the lack of two things prevent python from being already the de-facto standard: 1) documentation/ standard operating procedures and 2) an uniform interface for the download/installation of modules.

1) Environments such as MATLAB maintain their central role mostly because they are easy to start working with. Little customization is necessary and, most importantly, the documentation is in a single place, it's extensive and contains lots of *basic* examples for *novices* (basically SOPs for the most common tasks).
Ipython makes a wonderful step in the right direction with the ? magic operator, but a centralized and well-organized repository containing documentation of all scientific py modules would save much googling to the beginning user.

2) I second the comment about a CPAN-like repository. The lack of an uniform interface to install scientific modules is a big barrier for the average scientific user, who wants things that "just work" and can't be bothered in working out installation procedures and dependencies.

Thanks for all the great work..

HilbertAstronaut said...

@Alejandro I know the PySKI authors -- I'll ask them if they are ready to post links here.

harijay said...

The structural biology world has embraced python since its inception. Pymol( http://www.pymol.org), the phenix project (http://www.phenix-online.org/) , coot, the list is quite big. It would be almost impossible to practice crystallography today without using at least one tool or the other which incorporates python.

Herberth Amaral said...

Hi,

I wrote a post (in portuguese) discussing Python in Scientific World.

The culture of Python is very weak where I live, but some people (including me) are trying to make this reality a bit different.

You can see the post here: http://herberthamaral.com/2009/11/python-e-computacao-cientifica/

See ya!

user120 said...

Hi evryone,

I am developing a major project in python 2.5 version now and it will complete in a month.

I don't want any code changes till 2012.

Is it ok to just stick to 2.5 v rather than 3.0v? (Somehow I amn't convinced of using 2.6v.)

please reply.

Many Thanks.

Tenzig Norgay said...

Just found Guido's post now. The biggest thing you could do to advance the use of Python in science is to get Numpy installed on Google's AppEngine platform.

AppEngine is a wonderful thing, an creation of genius, really, but its lack of efficient numerical computation means that almost everyone in science ignores it. Put Numpy on AppEngine, and you will see a flowering of usage of both Numpy and AppEngine.

Guido van Rossum said...

@Tenzig Norgay: please voice your support on the App Engine tracker!

http://code.google.com/p/googleappengine/issues/detail?id=190