Tuesday, May 17, 2016

The AnyStr type variable

The AnyStr type variable

I was drafting a blog post on how to add type annotations for the new __fspath__()  protocol (PEP 519) when I realized that I should write a separate post about AnyStr . So here it is.

A simple function on strings

Let’s write a function that surrounds a string in parentheses. We’ll put it in a file named demo.py :

def parenthesize(s):
    return '(' + s + ')'

It works, too:

>>> from demo import parenthesize
>>> print(parenthesize('hola'))
(hola)

Of course, if you pass it something that’s not a string it will fail:

>>> parenthesize(42)
Traceback (most recent call last):
  File "demo.py", line 1, in
  File "demo.py", line 2, in parenthesize
TypeError: Can't convert 'int' object to str implicitly

Adding type annotations

Using PEP 484 type annotations we can clarify our little function’s signature:

def parenthesize(s: str) -> str:
    return '(' + s + ')'

Nothing to it, right? Even if you’ve never heard of PEP 484 before you can guess what this means. (Note that PEP 484 also says that the runtime behavior is unchanged. The calls I showed above will still have exactly the same effect, including the TypeError raised by parenthesize(42) .)

Polymorphic functions

Now suppose this is actually part of a networking app and we need to be able to parenthesize byte strings as well as text strings. Here’s how you’d implement that:

def parenthesize(s):
    if isinstance(s, str):
        return '(' + s + ')'
    elif isinstance(s, bytes):
        return b'(' + s + b')'
    else:
        raise TypeError(f"That's not a string, it's a {type(s)}")  # See PEP 498

With a fancy word we call that a polymorphic function. How do you write a signature for such a function? For the answer we have to dive a little deeper into PEP 484. It defines a nifty operator named Union  that lets us state that a type can be either this or that (or something else). In our case, it’s either str  or bytes , so we can write it like this:

from typing import Union

def parenthesize(s: Union[str, bytes]) -> Union[str, bytes]:
    if isinstance(s, str):
    # Etc.

Now let’s write a little main program with a bug, to show off the type checker:

from demo import parenthesize

a = parenthesize('hello')
b = parenthesize(b'hola')
c = a + b  ### bug here<-- bug="" span="">
print(c)

When we try to run this, the two parenthesize()  calls work fine (yay polymorphism!) but we get a TypeError on the last line:

$ python3 main.py 
Traceback (most recent call last):
  File "main.py", line 5, in
    c = a + b  ### bug here<-- bug="" span="">
TypeError: Can't convert 'bytes' object to str implicitly

The reason should be pretty obvious: in Python 3 you can’t mix bytes and str objects. And when we type-check this program using mypy we indeed get a type error:

$ mypy main.py 
main.py:5: error: Unsupported operand types for + (likely involving Union)

Debugging the bug

So let’s try a program without a bug:

from demo import parenthesize

a = parenthesize('hello')
b = parenthesize('hola')
c = a + b  ### bug here<-- bug="" no="" span="">
print(c)

Run it and it works great:

$ python3 main.py
(hello)(hola)

So the type checker should be happy too, right?

$ mypy main.py
main.py:5: error: Unsupported operand types for + (likely involving Union)

Whoops! The same error. What happened? Of course, I set you up, so I can explain something about type checking.

The trouble with tribbles unions

The type checker takes the signature at face value, so that when checking the call, it infers the type Union[str, bytes]  for every call to parenthesize() , regardless of what the arguments are. This is because, for most functions of even modest complexity, a type checker doesn’t understand enough about what’s going on in the function body, so it just has to believe the types in the signature (even though in this particular case it would probably be easy enough to do better).

In our test program the types of a  and b  are both inferred to be exactly what parenthesize()  claims to return, i.e., both variables have the type Union[str, bytes] . The type checker then analyzes the expression a + b , and for this it discovers a problem: if a is either str or bytes, and so is b , then the +  operator may be invoked on any of these combinations of types: str + str , str + bytes , bytes + str , or bytes + bytes . But only the first and the last are valid! In Python 3, str + bytes  or bytes + str  are invalid operations.

Aside: Even in Python 2, those two are suspect: since while 'x' + u'y'  indeed works (returning u'xy' ), other combinations will raise UnicodeDecodeError, e.g.:

>>>'Fran├ž' + u'ois'
Traceback (most recent call last):
  File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
ordinal not in range(128)

Anyway, the type checker doesn’t like this business, and it rejects operations on Unions where some combinations are invalid. What can we do instead?

Function overloading

One option would be function overloading. PEP 484 defines a magical decorator, @overload , which lets us get around this problem. We could write something like this:

from typing import overload

@overload
def parenthesize(s: str) -> str: ...
@overload
def parenthesize(s: bytes) -> bytes: ...

This tells the type checker that if the argument is a str , the return value is also a str , and similarly for bytes . Unfortunately @overload  is only allowed in stub files, which are a kind of interface definition files that show a type checker the signatures of a module’s contents without giving the implementation.

Type variables

Fortunately there’s an even better way, using type variables. This is how it goes:

from typing import TypeVar

S = TypeVar('S')

def parenthesize(s: S) -> S:
    if isinstance(s, str):
        return '(' + s + ')'
    elif isinstance(s, bytes):
        return b'(' + s + b')'
    else:
        raise TypeError("That's not a string, dude! It's a %s" % type(s))

Well… Almost. Our main.py program (unchanged from above) now gets a clean bill of health, but when we type-check this version we get errors on both return  lines:

demo.py: note: In function "parenthesize":
demo.py:7: error: Incompatible return value type: expected S`-1, got builtins.str
demo.py:9: error: Incompatible return value type: expected S`-1, got builtins.bytes

This is a bit hard to fathom, but the fix is what I was leading up to anyway, so I’ll reveal it now:

from typing import TypeVar

S = TypeVar('S', str, bytes)

def parenthesize(s: S) -> S:
    if isinstance(s, str):
        return '(' + s + ')'
    elif isinstance(s, bytes):
        return b'(' + s + b')'
    else:
        raise TypeError("That's not a string, dude! It's a %s" % type(s))

The only changed line is this one:

S = TypeVar('S', str, bytes)

This notation is called a type variable with value restriction . Yes, it’s mouthful; we sometimes also call it a constrained type variable. S is a type variable restricted to a set of types. It also has the advantage of telling the type checker that types other than str  or bytes  are not acceptable. Without that, a call like this would have been considered valid:

x = parenthesize(42)

because the original type variable (without the restrictions) doesn't tell mypy that this is a bad idea.

In fact, this particular use case (a type variable constrained to str or bytes) is so commonly needed that it's predefined in the typing module, and all we have to do is import it:

from typing import AnyStr

def parenthesize(s: AnyStr) -> AnyStr:
    # Etc. -- trust me, it works!

Real-world use of AnyStr

In fact, this is how many polymorphic functions in the os  and os.path  modules are defined. For example, in the stub for os.py  we find definitions like the following:

def link(src: AnyStr, link_name: AnyStr) -> None: ...

and also this:

def split(path: AnyStr) -> Tuple[AnyStr, AnyStr]: ...

These show us a bit more of the power of type variables: the signature for link()  indicates that either both arguments must be str  or both must be bytes ; split()  demonstrates that the type variable may also occur in more complex constructs: splitting a str returns a tuple of two str objects, while splitting bytes returns a tuple of two bytes  objects.

That’s all I wanted to share about AnyStr . Thanks for comments on the draft to Stephen Turnbull, Koos Zevenhoven, Ethan Furman, and Brett Cannon.

5 comments:

Lilah Avidan said...

A very clear explanation, thanks!

BTW, given the large amount of existing 2.7-based projects that could benefit from gradually adding type annotation, it might be appropriate to show the comment-based syntax in one or two of the examples in this and similar posts.

Guido van Rossum said...

Here's a primer on type annotations for Python 2: http://mypy.readthedocs.io/en/latest/python2.html. Also see the section in PEP 484: https://www.python.org/dev/peps/pep-0484/#suggested-syntax-for-python-2-7-and-straddling-code

Hobson said...

Wish there'd been a way to reuse existing syntax for a Union of types, like the tuple of types used in `except` and `isinstance`. `Union[str, bytes]` seems less obvious and lengthier than `(str, bytes)` or even `set(str, bytes)`.

Guido van Rossum said...

Hobson: Thank you for the question. I have now answered it at some length here: http://neopythonic.blogspot.com/2016/05/union-syntax.html

Haoxun Zhan said...

I've hacked something like `isinstance(obj, Union[int, float])`: https://github.com/huntzhan/magic-constraints