Thursday, January 29, 2009

Detecting Cycles in a Directed Graph

I needed an algorithm for detecting cycles in a directed graph. I came up with the following. It's probably something straight from a textbook, but I couldn't find a textbook that had one, so I came up with this myself. I like the simplicity. I also like that there's a well-defined point in the algorithm where you can do any additional processing on each node for which you find that is not part of a cycle.

The function makes few assumptions about the representation of the graph; instead of a graph object, it takes in two function arguments that are called to describe the graph:
  • def NODES(): an iterable returning all nodes
  • def EDGES(node): an iterable returning all nodes reached via node's outgoing edges
In addition it takes a third function argument which is called once for each node:
  • def READY(node): called when we know node is not part of any cycles
The function returns None upon success, or a list containing the members of the first cycle found otherwise. Here's the algorithm:
def find_cycle(NODES, EDGES, READY):
todo = set(NODES())
while todo:
node = todo.pop()
stack = [node]
while stack:
top = stack[-1]
for node in EDGES(top):
if node in stack:
return stack[stack.index(node):]
if node in todo:
stack.append(node)
todo.remove(node)
break
else:
node = stack.pop()
READY(node)
return None
Discussion: The EDGES() function may be called multiple times for the same node, and the for loop does some duplicate work in that case. A straightforward fix for this inefficiency is to maintain a parallel stack of iterators that is pushed and popped at the same times at the main stack, and at all times contains iter(node). I'll leave that version as an excercise.

Update: Fixed a typo in the algorithm (EDGES(top)) and renamed all to todo.

26 comments:

Johannes said...

I think another potential inefficiency are the linear searches for what's on the stack. A fix might be to maintain a map nodes_by_index, which map from the nodes on the stack to their indices... Still, I like your version, it's very readable.

mlk said...

That's very nice. I think it would make it even more easy to read and understand if `all` was renamed to `unchecked`.

Paul said...

If you need to do this for large graphs, Tarjan's algorithm can do it in linear time.

Here is a Python implementation.

Guido van Rossum said...

@Paul: I believe my algorithm can be linear time too (in something like the number of nodes+edges) with the iter-stack optimization and e.g. a set parallel to the stack to be able to test "node in stack" in constant time (what @Johannes pointed out).

That's assuming stack pushes and pops and hash table lookups are linear time, which IMO they are or can be made to be with enough trickery (e.g. pre-allocating a fixed stack).

@mlk: good point. Or 'todo' (I like short names :-).

Rafe Kaplan said...

You could also make the function in to an iterable so that it can find all cycles. Someone who wants just one could just call it and get the first element.

Paul said...

Ah, true.

Tarjan's algorithm can find *all* the cycles in a directed graph (or rather, all the strongly connected components, which includes things more complicated than cycles), with the same worst case complexity as detecting a single cycle, (which, now that I read your post more carefully, is what you are doing here). It's interesting that it's possible to do this fast.

By the way, "for node in EDGES(node):" should probably read "for node in EDGES(top):".

cdtgoogle said...

http://grapy.googlecode.com has a strongly connected components function you can use for this as well.

Alexandre said...

@Paul: I agree with you, the top variable is not used, make only sense if it's:
"for node in EDGES(top):"

lorg said...

I would recommend checking out the networkx graph library.

Apart from being quite mature, it can also find strongly connected parts, which seems to be what you're interested in.

rjprins said...

This is not a linear algorithm because statements like "if node in stack" and "if node in all" are actually nested loops.

Here is another way to find cycles:

- count the number of incoming edges for each node. O(n+e)
- put each node with 0 incoming edges in a set S. O(n)
- while S has nodes:
- take a node n from S.
- delete n and update the number of incoming edges of the nodes connected to n. if any of those nodes now have 0 incoming edges, put them in S.
- endwhile O(n+e)
- if S is empty put there are still nodes left, you have cycle.

Guido van Rossum said...

@rjprins: True, "if node in stack" is a search loop (but could easily be replaced by a hash table lookup as I said above). However "if node in todo" is a set membership test, which is a hash table lookup, not an implied loop.

To everyone recommending existing graphing libraries: I'm actually not interested in the cycles, I'm interesting in proving their absence, and then doing a bottom-up algorithm on the DAG. The graphs I am thinking of are not all that large; I'm expecting thousands of nodes but not millions (if you must know, it represents a program, not a social network :-).

subtenante said...

I had this problem way back in schooldays, I resolved it (in Fortran...) like this :

Another way to see the problem is through node activation, which results in matrix/vector multiplication.

"Activate" each node.

Iterate :
On each iteration, each node will activate its descendents and desactivate itself, and keep being activated only if one of its predecessors do activate it (see it as a flow of information that nodes push into their children).

After n iterations, if any node is still activated, there must be a cycle (or else all of the initial activations on the nodes would have extinguished).

To achieve it, the mathematical operation is matrix multiplication.

See the graph as a matrix of 0's where i does not point to j, and 1 where it does.

Multiply your matrix by a vector of 1's, get the result and iterate : multiply n times (n is the number of nodes) your matrx by the last result. If the sum of your last vector is greater than 1, you have a graph.

I didn't count the brute version of this algorythm might not be linear at all, rather in the O(n*n). Optimizations can be done (iterate only while you still have 1's, etc.).

lorg said...

@Guido:
1. Even when your problem is more limited in scope, I find using existing libraries is best. (Especially since finding strongly connected parts is O(V+E) (DFS twice), and any algorithm to prove that there aren't any cycles could not be better in this regard.)

2. Your programs don't have loops? Care to elaborate, or is this a future blog post? :)

Guido van Rossum said...

@lorg: (a) It's not so much the limited scope of my problem but the fact that the algorithm is really only a small part of much more complex code I am writing anyway. (It's less than 20 lines!)

(b) Loops are runtime cycles, not compile-time cycles. My edges are dependencies between variable and function definitions, not between values. Eventually I will have to deal with recursive function definitions, but then I'll somehow have to break those cycles since I don't want to deal with cyclical definitions.

andrew cooke said...

your code is similar to something i used just the other day - it's basically your routine (i assume - i've not read it in detail, but assume it's dfs checking for duplicates) turned into a generator for edges: http://www.ics.uci.edu/~eppstein/PADS/DFS.py

in your case i think you'd just filter for the nontree nodes.

Samuel said...

I'm going to be off topic for which I apologize, but the bit of code here reminded me of something I was curious about earlier.

Is the 2 space indentation part of an internal Google coding guideline? I first noticed it in the app engine SDK. It seems all Python code coming out of Google shares this trait.

Just curious.. and sorry to get off topic. :D

Thank you for starting the history of Python blog. I find it fascinating and look forward to reading the next post.

Ralph said...

Why the uppercased NODES, EDGES, and READY? Don't want to be a hobgoblin, and I'm not, I prefer `my' to `self', but PEP 8 exists. :-)

Guido van Rossum said...

> Why the uppercased NODES, EDGES, and READY?

I was trying to emphasize the algorithm. These three parametrize the algorithm, and I felt the need to call them out. Any realistic implementation would not use functions but expand these in-line according to the needs of the application. So you can think of them as template parameters or macros. :-)

Abrehet said...

It is interesting discussion. I am also looking for an algorithm to find cycles + loop from directed graph possibly with linear complexity. This discussion gives me some idea. I will post which method I choose later
Abrehet

misty said...

sorry.what we want is not present here.we want the efficient algorithm to find out loop in a directed graph.

Unknown said...

What is the definition of a node that is READY? Playing with some cycles, it isn't quite obvious to me.

# aren't d and e not part of a cycle?
graph = { 'a': ['b', 'c'],
'b': ['c'],
'c': ['a'], # cycle
'd': ['e',],
'e': [],
}

def ready(node):
print "node %s is not in a cycle" % node

def nodes():
""" Return all of the nodes """
return graph.keys()

def edges(node):
""" Return the children of node """
return graph[node]

cycles = find_cycle(nodes, edges, ready)

print "cycles: %s" % cycles

$ ./cycle.py
cycles: ['a', 'b', 'c']

Ralph Corderoy said...

I don't think Guido's definition is that it will call READY() on all nodes that aren't in a cycle before returning a cycle that exists, or None. From a quick look, whether some get passed to READY() before the first cycle is spotted depends. Break you cycle and then see if READY() is being called.

Guido van Rossum said...

I just got an email fro Peter Norvig pointing out that the efficiency of the algorithm can be fixed by using an OrderedDict. Nice!

Peter Norvig said...

Here's my version using OrderedDict -- O(1) membership, and no ned to re-generate links.

def find_cycle(nodes, links):
"""Given a set of nodes, and a function links(node) => [child, ...], return
a path of nodes that forms a cycle, or None if there is no cycle. Example:
>>> find_cycle(range(10), lambda x: [x+1, (x+2)%10])
[9, 1, 3, 5, 7, 9]
"""
# Adapted from a blog post by Guido:
# http://neopythonic.blogspot.com/2009/01/detecting-cycles-in-directed-graph.html
# Maintain path as a a depth-first path in the implicit tree;
# path is represented as an OrderedDict of {node: [child,...]} pairs.
while nodes:
root = nodes.pop()
path = OrderedDict({root: links(root)})
while path:
# Consider the children at the fringe of the tree
children = path[next(reversed(path))]
while children:
child = children.pop()
if child in path: ## Cycle found
pathlist = list(path)
return pathlist[pathlist.index(child):] + [child]
if child in nodes: ## New child not seen before
path[child] = links(child)
nodes.remove(child)
break
else: ## no more children; pop back up a level
path.popitem()
return None

Guido van Rossum said...

Here's Peter's version: https://www.dropbox.com/s/s0hc6jr47hmoitf/find_cycle_norvig.py (He posted it here but the indentation was all messed up.)

feffe81 said...

Some literatur:

K.A.HawickandH.A.James,
EnumeratingCircuitsandLoopsinGraphswithSelf-ArcsandMultiple-Arcs