Wednesday, October 22, 2008

Sorting a million 32-bit integers in 2MB of RAM using Python

Someone jokingly asked me how I would sort a million 32-bit integers in 2 megabytes of RAM, using Python. Taking up the challenge, I leared something about buffered I/O.

Obviously this is a joke question -- the data alone would take up 4 megabytes, assuming binary encoding! But there's a possible interpretation: given a file containing a million 32-bit integers, how would you sort them with minimal memory usage? This would have to be some kind of merge sort, where small chunks of the data are sorted in memory and written to a temporary file, and then the temporary files are merged into the eventual output area.

Here is my solution. I'll annotate it in a minute.

NOTE: All my examples use Python 3.0. The main difference in this case is the use of file.buffer to access the binary stream underlying the text stream file.

#!/usr/bin/env python3.0
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4

def intsfromfile(f):
while True:
a = array.array('i')
a.fromstring(f.read(4000))
if not a:
break
for x in a:
yield x

iters = []
while True:
a = array.array('i')
a.fromstring(sys.stdin.buffer.read(40000))
if not a:
break
f = tempfile.TemporaryFile()
array.array('i', sorted(a)).tofile(f)
f.seek(0)
iters.append(intsfromfile(f))

a = array.array('i')
for x in heapq.merge(*iters):
a.append(x)
if len(a) >= 1000:
a.tofile(sys.stdout.buffer)
del a[:]
if a:
a.tofile(sys.stdout.buffer)
On my Google desktop (a 3 year old PC running a Googlified Linux, rating about 34000 Python 3.0 pystones) this took about 5.4 seconds to run, with an input file containing exactly 1,000,000 32-bit random integers. That's not so bad, given that a straightforward in-memory sort of the same input took about 2 seconds:
#!/usr/bin/env python3.0
import sys, array
a = array.array('i', sys.stdin.buffer.read())
a = list(a)
a.sort()
a = array.array('i', a)
a.tofile(sys.stdout.buffer)
Back to the merge-sort solution. The first three lines are obvious:
#!/usr/bin/env python3.0
import sys, array, tempfile, heapq
assert array.array('i').itemsize == 4
The first line says we're using Python 3.0. The second line imports the modules we're going to need. The third line here makes it break on those 64-bit systems where the 'i' typecode doesn't represent a 32-bit int; I am making no attempts to write this code portably.

Then we define a little helper that is a generator which reads integers from a file and yields them one at a time:
def intsfromfile(f):
while True:
a = array.array('i')
a.fromstring(f.read(4000))
if not a:
break
for x in a:
yield x
This is where the performance tuning of the algorithm takes place: it reads up to 1000 integers at a time, and yields them one by one. I had originally written this without buffering -- it would just read 4 bytes from the file, convert them to an integer, and yield the result. But that ran about 4 times as slow! Note that we can't use a.fromfile(f, 1000) because the fromfile() method complains bitterly when there aren't enough values in the file, and I want the code to adapt automatically to however many integers are on the file. (It turns out we write about 10,000 integers to a typical temp file.)

Next we have the input loop. This repeatedly reads a chunk of 10,000 integers from the input file, sorts them in memory, and writes them to a temporary file. We then add an iterator over that temporary file, using the above intsfromfile() function, to a list of iterators that we'll use in the subsequent merge phase.

iters = []
while True:
a = array.array('i')
a.fromstring(sys.stdin.buffer.read(40000))
if not a:
break
f = tempfile.TemporaryFile()
array.array('i', sorted(a)).tofile(f)
f.seek(0)
iters.append(intsfromfile(f))
Note that for an input containing a million values, this creates 100 temporary files each containing 10,000 values.

Finally we merge all these files (each of which is sorted) together. The heapq module has a really nice function for this purpose: heapq.merge(iter1, iter2, ...) returns an iterator that yields the input values in order, assuming each input itself yields its values in order (as is the case here).

a = array.array('i')
for x in heapq.merge(*iters):
a.append(x)
if len(a) >= 1000:
a.tofile(sys.stdout.buffer)
del a[:]
if a:
a.tofile(sys.stdout.buffer)
This is another place where buffered I/O turned out to be essential: Writing each individual value to a file as soon as it is available slows the algorithm down about twofold. Thus, by simply adding input and output buffering, we gained a tenfold speed-up!

And that's the main moral of the story.

Another lesson is praise for the heapq module, which contains the iterator merge functionality needed in the output phase. Also, let's not forget the utility of the array module for managing binary data.

And finally, let this example remind you that Python 3.0 is notso different from Python 2.5!

37 comments:

Paul Smith said...

The formatting for the code examples is really crappy: the indentation is wrong for some lines, and there's not enough of it for most others. Ironically, for Python, it is hard to read.

isagalaev said...

> let this example remind you that Python 3.0 is notso different from Python 2.5

Except that Python 2.5 doesn't have heapq.merge :-)

Tyler said...

Paul, I think that is just the way blogger handled the code.

The only thing I noticed is the imports are all on one line, which is against PEP 8.

I'd never heard of heapq.. thats pretty sweet

PouleJapon said...

I really appreciated your post, especially the use of generators, and heapq.

Indeed, that would be awesome if there was some kind of plugins in blogspot to write formatted code.

David said...

Maybe the question was asked jokingly, but this is a very common task in database implementation. RDBMSs have always had to be able to sort tables on disk that are arbitrarily larger than RAM. What you've described is a nice two-phase multiway merge sort (efficiently and in very good Pythonic style, of course :), which I imagine is taught in most database systems classes.

Dave
(P.S. Hi Guido!)

rbonvall said...

You could use an embeddable pastebin (such as gist) and get syntax highlighting for free: http://gist.github.com/18896 . Beautiful python code deserves to be shown nicely :)

Guido van Rossum said...

I cleaned up the comments. Embarrassingly, they were a copy-paste error when I moved the text from my Artima plaintext edit buffer to Blogger. Thanks for pointing it out. No thanks for flaming.

Guido van Rossum said...

PS. Python 2.6 does have heapq.merge()...

Tyler said...

Guido, my post wasn't meant as a flame if that is what you are referring to. I was just messing around. Actually, I've been playing with python since 1.5(.4 I think... but I can't remember) and I had never seen PEP 8 until maybe a month ago, so it was fresh in my mind as I read your code.

I was poking fun, when, in reality I have a habit of just putting all my imports on one line also.

Guido van Rossum said...

@tyler: No, I was thinking of paul smith.

Christian Witts said...

Nice example of making the most out of a limited system.

Ralph said...

If this is an "interview" question, then another approach is to clarify how many distinct 32-bit values there are to sort. If sufficiently low, then track their frequency as each is read and discarded, and then output that many of them in sorted order.

Olivier said...

Christophe Meessen suggests an algorithm described here

http://news.ycombinator.com/item?id=341037

It requires only two pass on the million integer and uses nothing more than the 2MB buffer. No temporary files or so.

The complexity is O nlogn

Paolo Bonzini said...

Now, what if you have to sort a million unique integers between 0 and 10,000,000? :-)

ptn said...

@Paolo Bonzini: Unique? Makes me think of a bitmap...

Guido van Rossum said...

@olivier: that's a cool way of doing it, but not for Python, where an int takes 12 bytes and the list takes another 4 bytes per value. Given some other overhead you'd end up having to make a lot of passes over the full input (the same number as would be feasible with my version -- I used 100 but it could likely be less), which would probably be slower. It's interesting that a heap features in both algorithms. I'm not convinced that it's the fastest way to sort though, despite the O(N log N) complexity. Python's built-in sort is wicked.

union said...

For nice code listings you can use google's own client side code highlighting lib:

google-code-prettify

My post on how to set it up:

Code highlighting on blogger

Guido van Rossum said...

@union: I followed your recipe, and I verified that all the elements are indeed in the page, but it doesn't work. What am I missing?

Boris Bluntschli said...

I think if you change the first two lines to

<link href="http://google-code-prettify.googlecode.com/svn/trunk/src/prettify.css" rel="stylesheet" type="text/css"/>
<script src="http://google-code-prettify.googlecode.com/svn/trunk/src/prettify.js" type="text/javascript">

it should work. You're currently linking to the SVN repository browser, not the files themselves.

union said...

Yes Boris is correct, I included wrong url's ( but somehow managed to put the right ones in my template ).

I have corrected my original post or you can just use the tag's Boris posted in his comment.

LetsEatLunch said...

Beastly...

Luis said...

Thanks Guido!

By the way, when are you replying the other questions? :-)

Joel Hockey said...

I had a go at implementing this using a different approach which runs much faster on my computer. I read the file multiple times and select values within a given range that will (hopefully) fit in the 1Mb memory limit. This is then sorted and appended to the output. I choose the ranges by partitioning the possible 32-bit values (-2^31, 2^31-1) into 8 sections.

The timing numbers I get are:
in-memory sort: 2.09s
guido merge sort: 10.90s
my select range sort: 2.92s

#!/usr/bin/env python3.0
import array

outf = open('joel-sorted.dat', 'wb')

INT_MIN, INT_MAX = -(2**31), 2**31-1
size = int(2**32 / 8)
cutoffs = range(INT_MIN, INT_MAX, size)

# read file through and only collect values in our range
for i in range(len(cutoffs)):
inf = open('nums.dat', 'rb')
  min, max = cutoffs[i], cutoffs[i]+size
  a = array.array('i')
  while True:
    temp = array.array('i', inf.read(500000))
    if not temp:
      break
    for j in temp:
      if j >= min and j < max:
        a.append(j)

  # sort and append to output
  array.array('i', sorted(a)).tofile(outf)

Ralph said...

Hi Joel Hockey, splitting 2**32 into 8 gives a range of 2**29. All 1,000,000 integers, roughly 2**20, could easily fit into the first slice so you'd be sorting the whole file in one go; thereby exceeding the memory limitation. You'd need to split 2**32 into many more chunks than 8 to avoid exceeding memory, and that will probably slow you down a lot. Cheers, Ralph.

dominic said...

How about using Numpy to do read them in with frombuffer()? Something like in the multiprocess test here.

Joel Hockey said...

Hi Ralph. Yes it is possible for this approach to exceed the 2M memory limit if the numbers are skewed. I should have mentioned that it makes the assumption that the numbers are uniformly random.

If the 1M 32-bit ints are sufficiently uniform though, then this approach uses about 1Mb of memory. The range of possible values for a 32-bit signed int is [-2147483648, 2147483648) (lower number is included, upper number is excluded). I break this range up into 8 smaller ranges - [-2147483648, -1610612736), [-1610612736, -1073741824), ... [1610612736, 2147483648). Out of 1 million uniformly random 32-bit numbers, you would expect 125,000 to fall into each range. In fact, as long as there are not more than 250,000 numbers in each sub-range, the approach will not exceed the memory limit.

I chose 8 sub-ranges to be conservative. You could choose 3 or 4, or even possibly 2, but then you would need to sort the numbers in-place and be a lot trickier which is likely to make the code slower. The performance for 8 was not much worse than 3, and not much different to a full in-memory sort. My calculation for the memory used is:
* 500,000 bytes holding input numbers which can be reclaimed before sub-range sorting.
* 500,000 (approx) bytes holding numbers in given sub-range (125,000 32-bit ints)
* 500,000 (approx) bytes to hold sorted numbers in sub-range.

Total mem: approx 1Mb

Like I said, this does assume that the numbers are uniformly random. I did consider doing a first pass of the numbers to work out the optimal bounds for the sub-ranges, but this crude approach was enough to satisfy my curiosity.

Hopefully Jon Bentley will show up soon and wow us all with some way to do it that is way faster and uses less memory.

Ted H said...

GvR - this is the post that put your new blog on my radar. Subscribe to feed, Check.

I am looking forward to reading many more posts - keep it up.

Also, thanks for Python.

jurjen said...
This comment has been removed by the author.
jurjen said...

Hmm. Still I am not convinced that the disk buffers for these 100 files don't actually just contain all the numbers, so that you seem to be cheating a bit: it is a "sort in disk buffer space".
I would be really hard, I think, to actually force the program to use little memory.
Still, heapq.merge is cool!

Stoune said...

This is task more for Donald Knuth.
I think Python influence not so big.

Travis said...

For what it's worth, NumPy and memory-mapped arrays can solve this problem quite simply and quickly (how much memory is used depends on the OS):

import numpy as np

np.memmap("file_with_integers", dtype=int).sort(kind='merge')

pranny said...

For a beginner like me, this code and post was surely educative enough. I have one more point to, "why Python"

Linn said...

private void SortFile(string unsortedFileName, string sortedFileName)
{
//Read the file into an array of ints
byte[] file = System.IO.File.ReadAllBytes(unsortedFileName);
int[] numbers = new int[file.Length / 4];
for (int i = 0; i < numbers.Length; i++)
{
numbers[i] = BitConverter.ToInt32(file, i);
}

//Sort the array
Array.Sort(numbers);

//Write the results back to file
byte[] tempArray = new byte[4];
for(int j = 0; j < numbers.Length; j++)
{
tempArray = BitConverter.GetBytes(numbers[j]);
file[j] = tempArray[0];
file[j+1] = tempArray[1];
file[j+2] = tempArray[2];
file[j+3] = tempArray[3];
}

System.IO.File.WriteAllBytes(sortedFileName,file);

}

Colin said...

@Linn: By reading all 1000000 4byte integers in at once, you're using atleast 4000000 bytes of memory. But you're only given 2MB. Plus, it looks like you're using another 4000000 bytes in your "file" array.

FutureNerd said...

The question is not a joke! But since it seems like the kind of thing Microsoft would use in a job- interview puzzle (pointless obsessive byte-squeezing optimization!), I'll be a little cryptic:

With a main working array that's only 2,000,000 bytes (Python or any language will have some fixed overhead, I think it's fair not to count that), you can write the program so that the only I/O it does is to read the unsorted input file in one pass, then write the sorted output file in one pass. No temporary files, seeks, additional buffers, or virtual arrays are needed.

2e6 bytes is a small percentage more than is needed. Given a fixed percentage slack, the run time scales like n log n.

In Python you can avoid the memory overhead that Guido mentions, by using the array module.

Anand said...

http://anandtechblog.blogspot.com/2010/12/how-to-sort-4gb-files-google-question.html

Shriram Kunchanapalli said...

@Anand: What's the run time complexity of your solution?