Friday, December 11, 2009

While-you-type Searching

Here's an idea that is just begging to be implemented as a Firefox extension.

You know how there's a while-you-type spell checker that's always on when you are editing text in a multi-line text box? There should be a feature that takes the last few words you're typing (or the entire current paragraph, or whatever works best), does a Google search, and presents snippets for the top few results in an unobtrusive pop-up window.

Sure, maybe you're thinking "It looks like you're writing a letter. Do you want me to write it for you?" (the Microsoft paperclip). But using web search instead of a fixed set of patterns could actually make this useful. Imagine the number of messages to customer support forums that will never have to be sent because this feature pops up the answer the user was looking for. And so on.

You might also think, "this already exists, it's called Google auto-suggest." But I specifically want it to work when I'm not (yet) actively searching, but just writing. (If it already existed, it might have stopped me from writing this blog post. :-) Twitter might also become a different place if users realized how many others have already entered the same item.

Of course there's a little privacy issue. But still, if this existed, I'd opt in! (In fact, I did half a dozen searches while I was typing this. How much easier it would be if I didn't have to select text, switch to a different tab, paste, and hit enter, losing my writing context each tim.)

Thursday, November 5, 2009

Python in the Scientific World

Yesterday I attended a biweekly meeting of an informal a UC Berkeley group devoted to Python in science (Py4Science), organized by Fernando Perez. The format (in honor of my visit) was a series of 4-minute lightning talks about various projects using Python in the scientific world (at Berkeley and elsewhere) followed by an hourlong Q&A session. This meant I didn't have to do a presentation and still got to interact with the audience for an hour -- my ideal format.

I was blown away by the wide variety of Python use for scientific work. It looks like Python (with extensions like numpy) is becoming a standard tool for many sciences that need to process large amounts of data, from neuroimaging to astronomy.

Here is a list of the topics presented (though not in the order presented). All these describing Python software; I've added names and affiliations insofar I managed to get them. (Thanks to Jarrod Millman for providing me with a complete list.) Most projects are easily found by Googling for them, so I have not included hyperlinks except in some cases where the slides emphasized them. (See also the blog comments.)
  • Fernando gave an overview of the core Python software used throughout scientific computing: NumPy, Matplotlib, IPython (by Fernando), Mayavi, Sympy (about which more later), Cython, and lots more.
  • On behalf of Andrew Straw (Caltech), Fernando showed a video of an experimental setup where a firefly is tracked in real time by 8 camaras spewing 100 images per second, using Python software.
  • Nitimes, a time-series analysis tool for neuroimaging, by Ariel Rokern (UCB).
  • A comparative genomics tool by Brent Pedersen of the Freeling Lab / Plant Biology (UCB).
  • Copperhead: Data-Parallel Python, by Bryan Catanzaro (working with Armando Fox) and others.
  • Nipype: Neuroimaging analysis pipeline and interfaces in Python, by Chris Burns (http://nipy.sourceforge.net/nipype/).
  • SymPy -- a library for symbolic mathematics in Pure Python, by Ondrej Certik (runs on Google App Engine: http://live.sympy.org).
  • Enthought Python Distribution -- a Python distro with scientific batteries inluded (some proprietary, many open source), supporting Windows, Mac, Linux and Solaris. (Travis Oliphant and Eric Jones, of Enthought.)
  • PySKI, by Erin Carson (working with Armando Fox) and others -- a tool for auto-tuning computational kernels on sparse matrices.
  • Rapid classification of astronomical time-series data, by Josh Bloom, UCB Astronomy Dept. One of the many tools using Python is GroupThink, which lets random people on the web help classify galaxies (more fun than watching porn :-).
  • The Hubble Space Telescope team in Baltimore has used Python for 10 years. They showed a tool for removing noise generated by cosmic rays from photos of galaxies. The future James Webb Space Telescope will also be using Python. (Perry Greenfield and Michael Droettboom, of STSCI.)
  • A $1B commitment by the Indian government to improve education in India includes a project by Prabhu Ramachandran of the Department of Aerospace Engineering at IIT Bombay for Python in Science and Engineering Education in India (see http://fossee.in/).
  • Wim Lavrijsen (LBL) presented work on Python usage in High Energy Physics.
  • William Stein (University of Washington) presented SAGE, a viable free open source alternative to Magma, Maple, Mathematica and Matlab.
All in all, the impression I got was of an incredible wealth of software, written and maintained by dedicated volunteers all over the scientific community.

During the Q&A session, we touched upon the usual topics, like Python 3 transition, the GIL (there was considerable interest in Antoine Pitrou's newgil work, which unfortunately I could not summarize adequately because I haven't studied it enough yet), Unladen Swallow, and the situation with distutils, setuptools and the future 'distribute' package (for which I unfortunately had to defer to the distutil-sig).

The folks maintaining NumPy have thought about Python 3 a lot, but haven't started planning the work. Like many other projects faced with the Python 3 porting task, they don't have enough people who actually know the code base well enough do embark upon such a project. They do have a plan for arriving at PEP 3118 compliance within the next 6 months.

Since NumPy is at the root of the dependency graph for much of the software packages presented here, getting NumPy ported to Python 3 is pretty important. We briefly discussed a possible way to obtain NumPy support for Python 3 sooner and with less effort: a smaller "core" of NumPy could be ported first, which would give the NumPy maintainers a manageable task, combined with the goal of selecting a smaller "core" which would give them the opportunity for a clean-up at the same time. (I presume this would mostly be a selection of subpackage to be ported, not an API-by-API cleanup of APIs; the latter would be a bad thing to do simultaneous with a big port.)

After the meeting, Fernando showed me a little about how NumPy is maintained. They have elaborate docstrings that are marked up with a (very light) variant of Sphynx, and they let the user community edit the docstrings through a structured wiki-like setup. Such changes are then presented to the developers for review, and can be incorporated into the code base with minimal effort.

An important aspect of this approach is that the users who edit the docstrings are often scientists who understand the computation being carried out in its scientific context, and who share their knowledge about the code and its background and limitations with other scientists who might be using the same code. This process, together with the facilities in IPython for quickly calling up the docstring for any object, really improves the value of the docstrings for the community. Maybe we could use something like this for the Python standard library; it might be a way that would allow non-programmers to help contribute to the Python project (one of the ideas also mentioned in the diversity discussions).

Thursday, September 17, 2009

Lovely Python!

I just heard from Bill Xu in China. His book "Lovely Python", an introduction to Python in Chinese, was just published and shot to the top-5 of china-pub.com's bestseller list (at some point it even was #2). I can't read Chinese, but I am very glad that there's a book on Python available for Chinese readers, and that's why I wrote a brief foreword for the book as well. (Also because I am one of the mentors of Zeuux.org.) Links:

This is the Lovely Python page on the book store:

This is the bestseller list:

This is the cover (with front and back) of Lovely Python:

This is my preface:

Wednesday, July 22, 2009

Scientists Discover That Hidden Persuaders Are Real

In yesterday's post I mentioned reading George Lakoff's book, The Political Mind. While I agree with the politics of the book in almost every instance, I was still disappointed. For one thing, the book "compresses well." (IOW it contains a lot of repetition. A Lot.) It also felt a bit like a classic bait-and-switch: the back flap touts "the science behind how our brains understand politics" but the contents are 90% political rhetoric, and I'm still in doubt about the science.

The author's premise is attractive enough as far as it goes: our brains don't make perfectly rational decisions, but are influenced by "framing"; and the Republicans have used this to their advantage while the Democrats with their belief in "pure reason" have not properly defended themselves by accepting the conservative framing (for example: "tax relief").

Well, there may be some recent scientific research that confirms that most people are not so good at rational decision making, but honestly, I thought that the importance of framing has been well known for a long time to all politicians -- and advertisers as well. As far as the recent scientific proof for this commonly-known fact, Jonah Lehrer's book "How We Decide" contains at least as much about the research, and the non-scientific parts of his book are better written and, I expect, more future-proof.

I'm also skeptical of the importance of Lakoff's discovery that frames are represented physically in the brain. That's about as insightful as saying that this blog entry exists physically in Google's computers (as magnetic fluctuations on a hard drive). Has he never heard of abstractions? He seems to argue that all of philosophy needs to be thrown away because it ignores this fact. I will gladly accept that we cannot treat the brain as a perfect mathematical machine, and using the embodyment of the mind will probably eventually help us understand consciousness (more likely than abstract reasoning like Douglas Hofstadter's approach, no matter how much I enjoy his puzzles and paradoxes).

But the important message to me is still about how the brain's software works. It's useful to know that frames are reinforced by trauma and repetition, and that it requires a lot of repetition of counteracting frames to override them once they're there. And yes, that the Bush government used this to its advantage is a great example. But I wanted to know more about the science, and less about the politics.

Lakoff's other point is that human beings are born to have emphathy with each other. But he doesn't mention much of the science behind this. That's because in the end he is a linguist, and linguists spend most of their time studying (and arguing about) human language, which was the result of a long evolutionary path and cannot necessarily explain it. And his oft-repeated use of the words America and American in connection to empathy is surely his own little joke, where he's trying to make the reader believe that American values are nurturing values by applying his own theory: say it over and over and the frame will be hard-wired (whether that's literally or figuratively :-) in the reader's mind. As a non-US-citizen I wished the emphasis was on human values, not American values.

Tuesday, July 21, 2009

Progressive vs. Conservative

[Warning: loose thoughts ahead!]

Microsoft's Eric Meijer gave a talk at Google yesterday, and afterwards I had lunch with him. One of his remarks was (I paraphrase) that Microsoft users want to be told what to do, while the Java community is more vocal or argumentative. (He didn't discuss the Python community but in my experience it falls in the latter category.)

Now, while lying sick in bed with a hacking cough, I am reading George Lakoff's "The Political Mind". This book tries to model the distinction between conservative and progressive politics on the differences between two different ideal family models: the strict father (from which most conservative moral virtues flow according to Lakoff), and the nurturing family, from which the progressive moral virtues derived.

The parallel with Microsoft users vs. Java users seems to be all too obvious: Microsoft as the strict father: If you are loyal you will be rewarded, but if you stray you will be punished; whereas in the Java (or Python) community benefits and moral goodness flow from helping each other (which includes sharing open source software, and, apparently, bikeshedding :-).

What about other companies and communities? I can't help thinking of Oracle as the ultimate strict-father company, which makes me worry about the Sun takeover. Are Linus Torvalds and Richard Stallman strict fathers?

Friday, June 26, 2009

IronPython in Action and the Decline of Windows

While CPython still rules on python-dev, especially with the excitement around Py3k, Python's alternative implementations are growing up: PyPy is now capable of running Django, Jython just released version 2.5, and IronPython has been releasing significant milestones like clockwork. I get a lot of satisfaction out of such milestones: they help establish Python as a language you can't ignore, no matter which platform you are using.

Seeing a book like IronPython in Action, by Michael Foord and Christian Muirhead, is another milestone for IronPython. This is a solid work in every aspect, and something nobody using IronPython on .NET should be without. The book is chock full of useful information, presented along with a series of running examples, and covers almost every aspect of IronPython use imaginable.

After reading the table of contents and the introduction by IronPython's creator Jim Hugunin, I couldn't help myself and skipped straight to appendix A, "A whirlwind tour of C#." This is a useful thing to have around for readers like myself who haven't really kept track of things in the .NET world. Maybe I'll comment more on C# another time. For now, let me just say that it seems a decent enough system programming language. The more relevant thing about C# is that you can't avoid learning it if you are developing on .NET, even when using IronPython. There just are too many issues where IronPython has to work around a limitation of C#. This happens often indirectly where a particular API was designed purely with C# in mind. And then there's the issue that Microsoft's API documentation focus on C#. (And VB.NET, I suppose, which after seeing some samples of in this book I have less desire to know than ever.)

There are some introductory chapters -- some fluff about .NET and the CLR, an introduction to Python, and an introduction to with .NET objects from IronPython. The Python introduction has a slight emphasis on differences between IronPython and CPython, though there aren't enough to fill a chapter. This is a good thing! The chapter does a pretty good job of teaching Python, assuming you already know programming. In general, the book is aimed solidly at professional software developers: unless you are paid to do it, why would anyone want to get intimate with Windows?

Yes, Windows programming is what this book is really about. I'm sure that doing Windows programming using IronPython is a much better proposition than Windows programming using C++; but it's still Windows programming. Fortunately the authors maintain a slightly ironic attitude about Windows. I can't help admiring their persistency in getting to the bottom of the many mysteries presented by Windows(and in some cases by IronPython's wrappers).

Many, many years ago -- so long ago that I can't even recall when -- I did some Windows programming myself, using Mark Hammond's Win32 extensions for CPython. That package maps the Win32 API pretty directly to Python. It lets you work with Windows in much the same way as you can in IronPython -- the main difference is that IronPython lives in the more modern .NET world, while Win32 is showing its age.

But is life with IronPython all that much better than in CPython+Win32? It still looks incredibly tedious to create the simplest of UI. Each button in the UI has to be tediously positioned and configured (width, height, padding, font size, etc.). The book maintains a running example throughout many of the chapters, and one of the earlier versions (with many features not yet developed) clicks in at a "mere" 258 lines. Fortunately, the source code for all examples can be downloaded from the book's website. While the downloaded zip file is a whopping 33 megabytes, there's actually only half a megabyte of source code in it (and much of it multiple versions of the same running example) -- the majority of the download is not source code but DLLs that are probably included by the authors because Microsoft scatters them around half a dozen or more different support websites. (Plus, there seem to be multiple copies of IronPython.dll and a few other DLLs included.)

This then, was the big eye-opener for me: that despite all the hype, Windows UI programming is as tedious today as it was in 1995. Sure, the new UI looks a lot better. But that's mostly glitz: 3D effects, color gradients, video, and so on. Sure, it's all object-oriented now. But it hasn't really gotten any less complex to create the simplest of simple UIs. And that's a shame. When is Microsoft going to learn the real lesson about simplicity of HTML? Instead, Microsoft is doing the same thing to HTML that it does to anything it touches: adding cruft to the point where the basic functionality is buried so deeply that most people can't even find it. You can't really blame the average Windows developer for focusing on eye candy instead of usability -- it's all mashed together in the APIs, and by the time you've got something that works at all, you're too exhausted to look at it from your users' perspective.

It's no wonder that users are switching to the web as the platform for everything that used to live on the desktop -- with all its flaws (which I will discuss another time), web development still feels like a breeze compared to Windows development. And that means less time to release, and hence more frequent releases, which in turn means more opportunities for developers to learn what their users actually do. Which as a user I really appreciate.

Tuesday, June 16, 2009

Highs and Lows of IEEE Computer Magazine

I still read a few print publications, including IEEE Computer. Today's issue contained a high and a low:

Today's high point was a detailed history of the Conficker worm. Since we're a Macintosh family, and Google typically has its security stuff in order, I was barely aware of it. The sophistication of the worm's creators is almost admirable. (They probably use Python too. :-) An interesting table in the article included information about which countries contribute the most to the worm's population. China, Brazil and Russia top the list. You could have all sorts of theories on why this would be; personally I'm assuming it's a combination of sheer number of computers plus widespread use of bootlegged copies of Windows.

The low point was an article on "Software Engineering Ethics." Why a low point? Look at this table and think of how many bits of information it contains:






Using postphenomenology for software engineering ethics
ActionsDesirableUndesirable
Amplify experiences that are+-
Reduce experiences that are-+
Invite actions that are+-
Inhibit actions that are-+

Ironically, this pointless table contains a redundant column, while the table I mentioned above was missing a column that would have been useful -- how many PCs are installed in each country. Oh well.

PS: Googling for "postphenomenology" gives this as the title of the first hit: "If phenomenology is an albatross, is postphenomenology possible?" The web knows best.

Monday, June 15, 2009

New App Engine Book

At Google I/O I received a copy of Using Google App Engine by Charles Severance, published by O'Reilly. I haven't kept track, but this appears one of the first App Engine books to actually hit the stores -- an Amazon search for App Engine showed up one other book (Developing with Google App Engine by Eugene Ciara, published by APress) and many titles available for pre-order (including additional titles from the same publishers).

Severance's book is a quick read if you're already familiar with the basic premises of web programming. I think it would do well in an introductory course about the topic. (The author teaches at the University of Michigan so this is likely how he developed the material in the first place.) In fact, quite a bit of the book could well have come from a pre-existing earlier course: the chapters on HTML, CSS, Python and JavaScript barely mention App Engine.

Don't get me wrong, I think that's a good approach: in my experience quite a few App Engine users are new to web programming in general, or could at least use a refresher course. If you don't fall in this category, don't feel offended: you just probably aren't the intended audience for this book. On the other hand, if you've developed for the web but haven't used Python before, you could probably just skip the HTML/CSS chapter and dive right into Python and App Engine.

If you're a blank sheet where it comes to programming, don't expect to come out an experienced Python developer: the book only covers enough of the language so you can get started with App Engine without feeling you're just copying and pasting text. The same is actually true for any topic covered -- in many cases the book actually recommends that you study a topic more in-depth using other resources. But in each case the book's coverage is enough to get you started with the creation of dynamic web sites, and that's the important part. After all, you didn't learn your mother tongue by studying the rules of grammar either: you learned a few nouns, a few verbs, a few adjectives, and a few grammatical forms ("Daddy throw toy again") and you were on your way to communicating with others.

Actually, if you read this book from cover to cover, you might not be ready to create the Greate American Website, but you'll be well past the "Daddy throw toy" level. For example, you'll be creating App Engine datastore models with ease, tying them together with forms, and you'll even be able to use simple AJAX patterns. You will also have learned about the importance of caching, and you'll have more than a fleeting experience debugging problems using tracebacks and logs.

I also enjoyed some of the history bits that Severance presents (it makes me feel old to see 1990 referred to as ancient history :-). A downside is that sometimes the exercises given at the end of each chapter seem to be focused more on assessing that you were awake during class than that you actually have learned a useful skill (e.g. "Give a brief history of the major phases of the internet"). Teachers considering to use this book in the classroom might appreciate such questions; but for self-study, I would focus on the difference between the class= and id= attributes in HTML...

What's missing? The book doesn't touch Django (except for the templating facility built into App Engine's webapp package, which is based on Django). If our customer support traffic is any indication, Django is very popular with professional App Engine developers. The book also doesn't describe the various APIs offered by App Engine for things like sending mail, fetching other web resources by URL, or image processing. But arguably you can learn those directly from the App Engine docs. Oh, and the book doesn't touch on App Engine's Java support. I expect other books will fill that void.

Tuesday, May 26, 2009

So you want to learn Python?

There's never a lack of books to use for learning Python. I occasionally receive books for review, but I don't have a particularly good yardstick to judge such books by: I find that they all contain some factual errors and some oddities of presentation, but I have no idea whether those matter for the readers. Even Knuth's books are full of errors: for example the errata for Vol. 1 (2nd ed.) are a staggering 80 pages, but I doubt anybody besides Knuth himself is bothered by this knowledge.

Recently I got a review copy of "Hello World", and a colleague kindly lent me his copy of "Practical Programming". I think it's interesting to compare the two a bit, since they both claim to be teaching Python programming to people who haven't programmed before. And yet their audiences are totally different!

"Hello World", published by Manning, is written by Warren Sande and his son Carter. The subtitle is "Computer programming for kids and other beginners", but I think if you're not a kid any more you might get annoyed by the rather popular writing style. If you are a kid, well, you will probably enjoy a book written with you in mind, and you will learn plenty. The only prerequisites are reading and typing skills, a computer that wasn't built in the stone age, and a desire to learn more about what goes on inside that computer. The book uses short chapters with lots of illustrations, often cartoons and jokes. There are lots of opportunities to try out the material and learn that way. Each chapter ends with a review section, some tests, and more experiments to try. The book pays plenty of attention to typical "gotchas", so that if you get stuck at some point, there probably is help nearby to get you unstuck.

"Practical Programming" is written by Jennifer Campbell, Paul Gries, Jason Montojo, and Greg Wilson. This a team composed of three university professors and a former student of theirs. Their purported goal is to teach Computer Science (with Capital Letters), and Python is merely a teaching vehicle. But they spend about half of the book on Python itself, covering roughly the same material as any introduction to Python, including "Hello World". Their intended audience is clearly more mature than that of the Sandes, and I would think that Carter Sande and his friends would have a hard time staying focused on the material as presented by Campbell et al. -- their illustrations and diagrams are more functional but a lot less fun.

Both books present a number of projects and running examples. Again, the difference in audience makes it likely that if you love one, you'll hate the other, and vice versa. "Hello World" uses examples from computer games. The games are extremely simple though: modern computer games are some of the most complex system around, and you can't expect to approach them using PyGame and a couple hundred lines of Python. "Practical Programming" takes its example from scientific data processing with an environmental touch: for example, a numerical series is presented as whale sightings over the years and 2-dimensional data is taken from deforestation data. No doubt this is done in an attempt to appeal to a certain kind of student, though the number of potential applications is so large that some students might just as well be turned off by the specific set of choices.

In the end, "Hello World" will leave the reader with a fair amount of practical Python experience, enough to get them started on the long road to becoming a programmer if they are so inclined, or at least enough to give them some idea of what it is that programmers do. "Practical Programming" tries to go further: it presents some well-known algorithms (there's even a discussion of MergeSort), and it has introductory chapters on topics like object-oriented programming and databases. The overall focus is still on being able to use all this new knowledge in one's professional life, and I hesitate to agree with the authors' apparent view that it teaches "Computer Science". Calling it "Computer Use" would cover the contents better, I think, and that's more in line with the series title as well ("The Pragmatic Programmers", also the publisher).

So, how do you learn about Computer Science? Some would no doubt recommend "Structure and Interpretation of Computer Programs" by Abelson and Sussman here. (Someone sent me review copy of that book too.) But really, SICP (as it is often referred to) has its own agenda: convincing the reader that the most important thing computers can do is interpreting computer programs. This agenda has arguably caused the proliferation of Scheme implementations and indoctrinated many young minds with certain ideas about how to design and implement programming languages. But personally, I recommend you go straight to the source. After all these years, there is still no substitute for Knuth.

[UPDATE: fixed book titles as commenters pointed out my typos.]

Monday, April 27, 2009

Final Words on Tail Calls

A lot of people remarked that in my post on Tail Recursion Elimination I confused tail self-recursion with other tail calls, which proper Tail Call Optimization (TCO) also eliminates. I now feel more educated: tail calls are not just about loops. I started my blog post when someone pointed out several recent posts by Pythonistas playing around with implementing tail self-recursion through decorators or bytecode hacks. In the eyes of the TCO proponents those were all amateurs, and perhaps that's so.

The one issue on which TCO advocates seem to agree with me is that TCO is a feature, not an optimization. (Even though in some compiled languages it really is provided by a compiler optimization.) We can argue over whether it is a desirable feature. Personally, I think it is a fine feature for some languages, but I don't think it fits Python: The elimination of stack traces for some calls but not others would certainly confuse many users, who have not been raised with tail call religion but might have learned about call semantics by tracing through a few calls in a debugger.

The main issue here is that I expect that in many cases tail calls are not of a recursive nature (neither direct nor indirect), so the elimination of stack frames doesn't do anything for the algorithmic complexity of the code, but it does make debugging harder. For example, if your have a function ending in something like this:
if x > y:
return some_call(z)
else:
return 42
and you end up in the debugger inside some_call() whereas you expected to have taken the other branch, with TCO as a feature your debugger can't tell you the value of x and y, because the stack frame has been eliminated.

(I'm sure at this point someone will bring up that the debugger should be smarter. Sure. I'm expecting your patch for CPython any minute now.)

The most interesting use case brought up for TCO is the implementation of algorithms involving state machines. The proponents of TCO claim that the only alternative to TCO is a loop with lots of state, which they consider ugly. Now, apart from the observation that since TCO essentially is a GOTO, you write spaghetti code using TCO just as easily, Ian Bicking gave a solution that is as simple as it is elegant. (I saw it in a comment to someone's blog that I can't find back right now; I'll add a link if someone adds it in a comment here.) Instead of this tail call:
return foo(args)
you write this:
return foo, (args,)
which doesn't call foo() but just returns it and an argument tuple, and embed everything in a "driver" loop like this:
func, args = ...initial func/args pair...
while True:
func, args = func(*args)
If you need an exit condition you can use an exception, or you could invent some other protocol to signal the end of the loop (like returning None).

And here it ends. One other thing I learned is that some in the academic world scornfully refer to Python as "the Basic of the future". Personally, I rather see that as a badge of honor, and it gives me an opportunity to plug a book of interviews with language designers to which I contributed, side by side with the creators of Basic, C++, Perl, Java, and other academically scorned languages -- as well as those of ML and Haskell, I hasten to add. (Apparently the creators of Scheme were too busy arguing whether to say "tail call optimization" or "proper tail recursion." :-)

Saturday, April 25, 2009

People Who Annoy Me

It's Grumpy Saturday. If I didn't respond to your email or "friend" request, maybe you fall in one of the following categories.

1. People who have exchanged some email with me (or who once commented on my blog, or were once in the same line at a conference) and send me a Linkedin friend request with the default invite message.

2. People I've never heard from asking to become my friend on Facebook.

3. People who send me a long form letter inviting me to speak at their conference, without any kind of personal note at the top.

4. People who resend that form letter three times.

5. People who get all upset when I respond to the last identical copy of that form letter with "Please stop spamming me."

6. People who send me a long rambling email asking me a Python programming question.

7. People who send me a long rambling email proposing a change to the Python language.

8. People who collect pictures of famous people's desktops.

9. People who send me email asking if they can put their ads on my website.

10. People who create fake blogs with computer-generated pseudo-nonsense text that happens to include my name.

Wednesday, April 22, 2009

Tail Recursion Elimination

I recently posted an entry in my Python History blog on the origins of Python's functional features. A side remark about not supporting tail recursion elimination (TRE) immediately sparked several comments about what a pity it is that Python doesn't do this, including links to recent blog entries by others trying to "prove" that TRE can be added to Python easily. So let me defend my position (which is that I don't want TRE in the language). If you want a short answer, it's simply unpythonic. Here's the long answer:

First, as one commenter remarked, TRE is incompatible with nice stack traces: when a tail recursion is eliminated, there's no stack frame left to use to print a traceback when something goes wrong later. This will confuse users who inadvertently wrote something recursive (the recursion isn't obvious in the stack trace printed), and makes debugging hard. Providing an option to disable TRE seems wrong to me: Python's default is and should always be to be maximally helpful for debugging. This also brings me to the next issue:

Second, the idea that TRE is merely an optimization, which each Python implementation can choose to implement or not, is wrong. Once tail recursion elimination exists, developers will start writing code that depends on it, and their code won't run on implementations that don't provide it: a typical Python implementation allows 1000 recursions, which is plenty for non-recursively written code and for code that recurses to traverse, for example, a typical parse tree, but not enough for a recursively written loop over a large list.

Third, I don't believe in recursion as the basis of all programming. This is a fundamental belief of certain computer scientists, especially those who love Scheme and like to teach programming by starting with a "cons" cell and recursion. But to me, seeing recursion as the basis of everything else is just a nice theoretical approach to fundamental mathematics (turtles all the way down), not a day-to-day tool.

For practical purposes, Python-style lists (which are flexible arrays, not linked lists), and sequences in general, are much more useful to start exploring the wonderful world of programming than recursion. They are some of the most important tools for experienced Python programmers, too. Using a linked list to represent a sequence of value is distinctly unpythonic, and in most cases very inefficient. Most of Python's library is written with sequences and iterators as fundamental building blocks (and dictionaries, of course), not linked lists, so you'd be locking yourself out of a lot of pre-defined functionality by not using lists or sequences.

Last, let's look at how we could implement tail recursion elimination. The first observation is that you can't do it at compile time. I've seen at least one blog entry that used a bytecode hack to replace a CALL opcode immediately before a RETURN opcode with a jump to the top of the function body. This may be a nice demo, but unfortunately Python's compiler cannot reliably determine whether any particular call is actually reference the current function, even if it appears to have the same name. Consider this simple example:
def f(x):
if x > 0:
return f(x-1)
return 0

It looks like you could replace the body with something like this:
if x > 0:
x = x-1
<jump to top>
return 0

This seems simple enough, but now add this:
g = f
def f(x):
return x
g(5)

The call to g(5) invokes the earlier f, but the "recursive" call no longer recurses! At run-time, the name 'f' is rebound to the later non-recursive definition, so the returned value is 4, not 0. While I agree that this particual example is bad style, it is a well-defined part of Python's semantics that has plenty of legitimate uses, and a compiler that made this replacement in the optimistic hope that f's definition will remain unchanged would introduce enough bugs in real-world code to cause an outrage.

Another blog post showed decorators that can be used to implement tail recursion using magical exceptions or return values. These can be written in plain Python (though that post shows an optimized Cython version that is claimed to be "only 10% slower", though it doesn't seem to be thread-safe). If this tickles your fancy I won't try to stop you, but I would strongly object against inclusion of something like this in the core distribution: there are many caveats to the use of such a decorator, since it has to assume that any recursive call (in the decorated function) is tail-recursive and can be eliminated. In the hands of less experienced users this could easily lead to disasters. For example, the common recursive definition of factorial is not tail-recursive:
def fact(n):
if n > 1:
return n * fact(n-1)
return 1

There are also plenty of functions that contain a tail-recursive call and another recursive call that isn't tail-recursive; the decorators don't handle such cases. Another subtlety that those decorators don't handle is tail-recursive calls inside try blocks: these may look like they could be eliminated, but they can't, because TRE could remove the exception handling which is guaranteed by the language. For all these reasons I think the decorator approach is doomed, at least for a general audience.

Still, if someone was determined to add TRE to CPython, they could modify the compiler roughly as follows. First, determine "safe" tail-recursive call sites. This would be something like a CALL opcode immediately followed by a RETURN opcode, and completely outside any try blocks. (Note: I'm ignoring the different CALL_* opcodes, which should be easy enough to handle using the same approach.) Next, replace each such CALL-RETURN opcode pair with a single CALL_RETURN opcode. There's no need for the compiler to try and check if the name of the function being called is the same as the current function: the new opcode can represent savings for all CALL+RETURN combinations merely by saving the time to decode a second opcode. If at run time the determination is made that this particular call is not applicable for TRE, the usual actions for a CALL followed by a RETURN opcode are carried out. (I suppose you could add some kind of caching mechanism indexed by call site to speed up the run-time determination.)

In the determination of wheter TRE can be applied, there are several levels of aggressiveness that you could apply. The least aggressive, "vanilla" approach would only optimize the call if the object being called is the function that is already running in the current stack frame. All we have to do at this point is clear the locals out of the current stack frame (and other hidden state like active loops), set the arguments from the evaluation stack, and jump to the top. (Subtlety: the new arguments are, by definition, in the current stack frame. It's probably just a matter of copying them first. More subtleties are caused by the presence of keyword arguments, variable length argument lists, and default argument values. It's all a simple matter of programming though.)

A more aggressive version would also recognize the situation where a method is tail recursive (i.e. the object being called is a bound method whose underlying function is the same as the one in the current stack frame). This just requires a bit more programming; the CPython interpreter code (ceval.c) already has an optimization for method calls. ( I don't know how useful this would be though: I expect the tail recursive style to be popular with programmers who like to use a functional programming style overall, and would probably not be using classes that much. :-)

In theory, you could even optimize all cases where the object being called is a function or method written in Python, as long as the number of local variables needed for the new call can be accommodated in the current stack frame object. (Frame objects in CPython are allocated on the heap and have a variable allocation size based on the required space for the locals; there is already machinery for reusing frame objects.) This would optimize mutually tail-recursive functions, which otherwise wouldn't be optimized. Alas, it would also disable stack traces in most cases, so it would probably not be a good idea.

A more benign variant would be to create Python-level stack frames objects just like before, but reuse the C stack frame. This would create an approximation of Stackless Python, though it would still be easy enough to run out of C stack by recursing through a built-in function or method.

Of course, none of this does anything to address my first three arguments. Is it really such a big deal to rewrite your function to use a loop? (After all TRE only addresses recursion that can easily be replaced by a loop. :-)

Tuesday, April 7, 2009

Italia Here I Come!

That is: PyCon Italia, here I come! It's still a month away (May 8-10), but I'm already looking forward to my vacation in the historic city of Florence and on the beautiful west coast of Italy. The organizers of PyCon Italia kindly invited me to give a keynote at their annual conference, making me an offer I couldn't refuse.

Kidding aside, it looks like it will be a very exciting conference, with Googlers Fredrik Lundh and Alex Martelli also coming to speak, as well as Python core developer Raymond Hettinger. And even though much of the program will be in Italian, real-time translations will be available for the main track. Personally, I'm most looking forward to a mysterious late-night event labeled PyBirra, where the locals will try to drink me under the table. Salute!

Friday, March 6, 2009

Capabilities for Python?

I received an email recently from Mark Miller, quoting a post from Zooko to the cap-talk mailing list (which I do not read). Mark asked me to clarify my position about capabilities (in Python, presumably). Since the last thing I need is another mailing list subscription, I'm posting my clarification here. I'm sure that through the magic of search engines it will find its way to the relevant places.

In his post, Zooko seems to believe that I am hostile to the very idea of capabilities, and seems to draw a link between this assumed attitude and my experience with the use of password-based capabilities in Amoeba. This is odd for several reasons. First, the way I remember it, Amoeba's capabilities weren't based on passwords, but on one-way functions and random numbers (and secure Ethernet wall-sockets, which is perhaps why the idea didn't catch on :-). Second, I don't believe my experience with capabilities in Amoeba made a difference in how I think about capabilities being offered by some modern programming languages like E, or about the various proposals over the years to add capabilities to Python, perhaps starting with an old proposal by Ka-Ping Yee and Ben Laurie. (It would be better to think of this as a subtraction rather than an addition, since such proposals invariably end up limiting the user to a substantially reduced subset of Python. More about that below.)

But the biggest surprise to me is that people are reading so much in my words. I'm not the Pope! I'm a hacker who likes to think aloud about design problems. Often enough I get it wrong. If you think you disagree with me, or have a question about what I said, just respond in the forum where I post (e.g. python-dev or python-ideas, or this blog), but please don't go forwarding my messages to lists I don't read and speculate about them.

With that off my mind, and with the caveat that this entire post is thinking alound, let me try to expose some of my current thoughts about capabilities and Python.

Note that I'm trying to limit myself to Python. Languages specifically written to support capabilities exist (e.g. E) and may well become successful, though I expect they will have a hard time gaining popularity until they also sprout some other highly attractive features: most developers see security as a necessary evil.

This attitude, of course, is the reason why the idea of adding security features to an existing language keeps coming back: it's assumed to be much more likely to convince the "unwashed masses" to switch to a slightly different version of a language they already know, than to get them to even try (let alone adopt) a wholly new language. This argument not limited to security zealots of course. The same reasoning is common in the larger world of "language marketing": C++ made compatibility with C a principle overruling all others, Java chose to resemble C or C++ for ease of adoption, and it is well known that Larry Wall picked many of Perl's syntactic quirks because the initial target audience was already using sed and sh.

I'll be the first to admit that wasn't completely free of this attitude for Python's design, although I didn't do it with the intent of gaining popularity: whenever I borrowed from another language, I did so either because I recognized a good idea, or because I didn't think I had anything to add to current practice, but not because I was concerned about market share. (If I had been, I wouldn't have used indentation for grouping. :-)

Anyway, regardless of the merits of this idea, it keeps coming back. A recent incarnation is Mark Seaborn's CapPython. Skimming through this wiki page it seems that Mark is well aware of the limitations: the section labeled "problem areas" takes up more than half of the page. And the most recent discussion (which also triggered Zooko's post I believe) started with a blog post by Tav where he proposes (with my encouragement) some modest additions to CPython's existing restricted execution mode and challenges the world to break into it. In a follow-up post, Tav provides a better history of this topic than I could provide myself.

And yet, I remain extremely skeptical of this whole area. The various attacks on Tav's supervisor code show how incredibly subtle it is to write a secure supervisor. CPython's restricted execution model lets sandboxed (= untrusted) code call into the supervisor, where the supervisor's Python code runs with full permissions. In Tav's version, the sandbox is given access to the supervisor only through a small collection of function objects which the supervisor passes into the sandbox. Tav's proposed changes remove some introspection attributes from function and class objects that would otherwise give the sandboxed code access to data or functions that the supervisor is trying to hide from the sandbox. This basic idea works well and nobody has yet found a way to break out of the sandbox directly -- so far it looks like no other attributes need to be removed in order to secure the sandbox.

However, several attacks found non-obvious weaknesses in Tav's supervisor code itself: it is deceptively easy to trick the supervisor into calling seemingly safe built-in functions with arguments carefully crafted by the code inside the sandbox so as to make it reveal a secret. This uses an approach that was devised years ago by Samuele Pedroni to dispell doubt that restricted execution was unsafe in Python 2.2 and beyond.

Samuele's approach combines two properties of (C)Python: built-ins invoked by the supervisor run with the supervisor's permissions, and there are many places in Python where implicit conversions attempt to call various specially-named attributes on objects given to them. The sandboxed exploit defines a class with one of these "magic" attributes set to some built-in, and voila, the built-in is called with the supervisor's permissions. It takes some added cleverness to pass an interesting argument to the built-in and to get the result back, but it can be done: for details, see Tav's blog.

My worry about this approach is that a supervisor that provides a reasonably large subset of Python will have to implement some pretty complex functionality: for example, you'll have to support a secure way to import modules. My confidence in the security of the supervisor goes down exponentially as the its complexity goes up. In other words, while Tav may be able to evolve the toy supervisor in "safelite.py" into an impenetrable bastion after enough iterations of exploit-and-patch, I don't think this approach will converge in a realistic timeframe (e.g. decades) for a more fully-featured supervisor.

This lets me segue into another, perhaps more generic, concern with the idea of providing a secure subset of Python, whether it's based on restricted execution, capabilities, or restricting attribute references (like CapPython and Zope's RestrictedPython). Python's claim to fame comes largely from its standard library. People's proficiency with the language is not just measured by how well they can construct efficient algorithm implementations using lists and dicts: to a large extent it depends on how much of the standard library they master as well. Python's standard library is large compared to many other languages. Only Java seems to have more stuff that's assumed to be "always there" (except in certain embedded environments).

For a "secure" version of Python to succeed, it will need to support most of the standard library APIs. I'm distinguishing between the implementations and APIs here, for it is likely that many standard library modules use features of the language that aren't available by the secure subset under consideration. This doesn't have to be a show-stopper as long as an alternate implementation can be provided that uses only the secure subset.

Unfortunately, I expect that, due to a combination of factors, it will be impractical to provide a sufficiently large subset of the standard library for a sufficiently secure subset of Python. One problem is that Python, being a highly dynamic language, it supports introspection at many levels, including some implementation-specific levels, like access to bytecode in CPython, which has no equivalent in Jython, IronPython or other implementations. Because of the language's dynamic and introspective features, there is often no real distinction between a module's API and its implementation. While this is an occasional source of frustration for Python users (see e.g. the recent discussion about asyncore on python-dev), in most cases it works quite well, and often APIs can be simpler because of certain dynamic features of the language. For example, there are several ways that dynamic attribute lookup can enhance an API: automatic delegation is just one of the common patterns that it enables; command dispatch is another. All this leads me to think that a secure version of Python is unlikely to become complete enough to attract enough users to become viable. I'd be happy to be proven wrong, but it seems that the people most attracted to the idea are hoping that adding capabilities Python will somehow provide a shortcut to success. Unfortunately, I don't think it's a shortcut at all.

I should mention that I have some experience in this area: Google's App Engine (to which I currently contribute most of my time) provides a "secure" variant of Python that supports a subset of the standard library. I'm putting "secure" in scare quotes here, because App Engine's security needs are a bit different than those typically proposed by the capability community: an entire Python application is a single security domain, and security is provided by successively harder barriers at the C/Python boundary, the user/kernel boundary, and the virtual machine boundary. There is no support for secure communication between mutually distrusting processes, and the supervisor is implemented in C++ (crucial parts of it live in a different process).

In the App Engine case, the dialect of the Python language supported is completely identical to that implemented by CPython. The only differences are at the library level: you cannot write to the filesystem, you cannot create sockets or pipes, you cannot create threads or processes, and certain built-in modules that would support backdoors have been disabled (in a few cases, only the insecure APIs of a module have been disabled, retaining some useful APIs that are deemed safe). All these are eminently reasonable constraints given the goal of App Engine. And yet almost every one of these restrictions has caused severe pain for some of our users.

Securing App Engine has required a significant use of internal resources, and yet the result is still quite limiting. Now consider that App Engine's security model is much simpler than that preferred by capability enthusiasts: it's an all-or-nothing model that pretty much only protects Google from being attacked by rogue developers (though it also helps to prevent developers from attacking each other). Extrapolating, I expect that a serious capability-based Python would require much more effort to secure, and yet would place many more constraints on developers. It would have to have a very attractive "killer feature" to make developers want to use it...Link

Thursday, January 29, 2009

Detecting Cycles in a Directed Graph

I needed an algorithm for detecting cycles in a directed graph. I came up with the following. It's probably something straight from a textbook, but I couldn't find a textbook that had one, so I came up with this myself. I like the simplicity. I also like that there's a well-defined point in the algorithm where you can do any additional processing on each node for which you find that is not part of a cycle.

The function makes few assumptions about the representation of the graph; instead of a graph object, it takes in two function arguments that are called to describe the graph:
  • def NODES(): an iterable returning all nodes
  • def EDGES(node): an iterable returning all nodes reached via node's outgoing edges
In addition it takes a third function argument which is called once for each node:
  • def READY(node): called when we know node is not part of any cycles
The function returns None upon success, or a list containing the members of the first cycle found otherwise. Here's the algorithm:
def find_cycle(NODES, EDGES, READY):
todo = set(NODES())
while todo:
node = todo.pop()
stack = [node]
while stack:
top = stack[-1]
for node in EDGES(top):
if node in stack:
return stack[stack.index(node):]
if node in todo:
stack.append(node)
todo.remove(node)
break
else:
node = stack.pop()
READY(node)
return None
Discussion: The EDGES() function may be called multiple times for the same node, and the for loop does some duplicate work in that case. A straightforward fix for this inefficiency is to maintain a parallel stack of iterators that is pushed and popped at the same times at the main stack, and at all times contains iter(node). I'll leave that version as an excercise.

Update: Fixed a typo in the algorithm (EDGES(top)) and renamed all to todo.

Tuesday, January 13, 2009

The History of Python - Introduction

Python is 19 years old now. I started the design and implementation of the language on a cold Christmas break in Amsterdam, in late December 1989. It started out as a typical hobby project. Little did I know where it would all lead.

With Python's coming of age, I am going to look back on the history of the language, from the conception as a personal tool, through the the early years of community building, (If Guido was hit by a bus?), all the way through the release of Python 3000, almost 19 years later. It's been quite an adventure, for myself as well as for the users of the language.

This won't be an ordinary blog post -- it'll be an open-ended series. I may invite guest writers. I'll be touching upon many aspects of the language's history and evolution, both technical and social.

I'll start with the gradual publication of material I wrote a few years ago, when I was invited to contribute an article on Python to HOPL-III, the third installment of ACM's prestigious History of Programming Languages conference, held roughly every ten years. Unfortunately, the demands of the rather academically inclined reviewers were too much for my poor hacker's brain. Once I realized that with every round of review the amount of writing left to do seemed to increase rather than decrease, I withdrew my draft. Bless those who persevered, but I don't believe that the resulting collection of papers gives a representative overview of the developments in programming languages of the past decade.

The next destination of the draft was a book on Python to be published by Addison-Wesley. Again, the mountain of raw material that I had collected was too large and at the same time too incomplete to serve as a major section of the book, despite the editing help I received from David Beazley, a much better writer than I am.

As they tell prospective Ph.D. students, the best way to eat an elephant is one meal at a time. So today I am publishing the first bit of the elephant, perhaps still somewhat uncooked, but at least it's out there. Hopefully others who were there at the time can help clear up the inevitable omissions and mistakes. I have many more chapters, each still requiring some editing, and I expect this to be a long-running series. Therefore I am starting a separate blog title for this, unimaginatively called The History of Python. Follow the link and enjoy!