Wednesday, May 18, 2016

Union syntax

Union syntax

(I'm trying to do this as a quick post in response to some questions I received on this topic. I realize this will probably reopen the whole discussion about the best syntax for types, but sorry folks, PEP 484 was accepted nearly a year ago, after many months of discussions and hundreds of messages. It's unlikely that any idea you can think of here would be new. This post just explains the rationale of one particular decision and tries to put it in some context.)
I've heard some grumbling about the union syntax in PEP 484: Union[X, Y, Z] (where X, Y and Z are arbitrary type expressions). In the past people have suggested X|Y|Z for this, or (X, Y, Z) or {X, Y, Z}. Why did we go with the admittedly clunkier Union[X, Y, Z]?

First of all, despite all the attention drawn to it, unions are actually a pretty minor feature, and you shouldn't be using them much. So you also shouldn't care that much.

Why not X|Y|Z?

This won't fly because we want compatibility with versions of Python 3 that were already frozen (see below). We want to be able to express e.g. a union of int and str, which under this notation would be written as int|str. But for that to fly we'd have to modify the builtin 'type' class to implement __or__ -- and that wouldn't fly on already-frozen Python versions. Supporting X|Y only for types (like List) imported from the typing module and some other notation for builtin types would only sow confusion. So X|Y|Z is out.

Why not {X, Y, Z}?

That's the set with elements X, Y and Z, using the builtin set notation. We can usefully consider types to be sets of values, and this makes a union a set of values too (that's why it's called union :-).

However, {X, Y, Z} confuses the set of types with the set of values, which I consider a mortal sin. This would just cause endless confusion.

This notation would also confuse things when taking the union of several classes that overlap, e.g. if we have classes B and C, where C inherits from B, then the union of B and C is just B. But the builtin set doesn't see it that way. In contrast, the X|Y notation could actually solve this (since in principle we could overload __or__ to do whatever we want), and the Union[] operator ("functor"?) from PEP 484 indeed solves this -- in this example Union[B, C] returns the (non-union) type B, both in the type checker and at runtime.

Why not (X, Y, Z)?

That's the tuple (X, Y, Z). It has the same disadvantages as {X, Y, Z}, but at least it has the advantage of being similar to how unions are expressed as arguments to isinstance(), for example isinstance(x, (int, str, list)) or isinstance(x, (Sequence, Mapping)). (Similarly the except clause: try: ... / except (KeyError, IndexError): ...)

Another problem with tuples is that the tuple syntax is already overloaded in so many ways that it would be confused with other uses even more easily. One particular confusion would be other generic types, for which we'd still want to use square brackets. (You can't really beat Iterable[int] for clarity if you have an iterable of integers. :-) Suppose you have a sequence of values that could be integers or strings. In PEP 484 notation we write this as Sequence[Union[int, str]]. Using the tuple notation we'd want to write this as Sequence[(int, str)]. But it turns out that the __getitem__ overload on the metaclass can't tell the difference between Sequence[(int, str)] and Sequence[int, str] -- and we would like to reject the latter as a mistake since Sequence[] is a generic class over a single parameter. (An example of a generic class over two parameters would be Mapping[K, V].) Disambiguating all this would place us on very thin ice indeed.

The nail in this idea's coffin is the competing idea of using (X, Y, Z) to indicate a tuple with three items, with respective types, X, Y and Z. At first sight this seems an even better use of the tuple syntax than unions would be, and tuples are way more common than unions. But it runs afoul of the same problems with Foo[(X, Y)] vs. Foo[X, Y]. (Also, there would be no easy way to describe what PEP 484 calls Tuple[X, ...], i.e. a variable-length tuple with uniform item type X.)

PS. Why support old Python 3 versions?

The reason for supporting older versions is adoption. Only a relatively small crowd of early adopters can upgrade to the latest Python version as soon as it's out; the rest of us are stuck on older versions (even Python 2.7!).

So for PEP 484 and the typing module, we wanted to support 3.2 and up -- we chose 3.2 because it's the newest Python 3 supported by some older but still popular Ubuntu and Debian distributions. (Also, 3.0 and 3.1 were too immature at their time of release to ever have a large following.)

There's a typing package that you can install easily using pip, and this defines all sorts of useful things for typing, from Any and Union to generic versions of List and Sequence. But such a package can't modify existing builtins like int or list.

(Eventually we also added Python 2.7 support, using type comments for function signatures.)

Adding type annotations for fspath

Type annotations for fspath

Python 3.6 will have a new dunder protocol, __fspath__() , which should be supported by classes that represent filesystem paths. Example of such classes are the pathlib.Path family and os.DirEntry  (returned by os.scandir() ).

You can read more about this protocol in the brand new PEP 519. In this blog post I’m going to discuss how we would add type annotations for these additions to the standard library.

I’m making frequent use of AnyStr , a quite magical type variable predefined in the typing module. If you’re not familiar with it, I recommend reading my blog post about AnyStr . You may also want to read up on generics in PEP 484 (or read mypy’s docs on the subject).

Adding os.scandir() to the stubs for os.py

For practice, let’s see if we can add something to the stub file for os.py. As of this writing there’s no typeshed information for os.scandir() , which I think is a shame. I think the following will do nicely. Note how we only define DirEntry  and scandir() for Python versions >= 3.5. (Mypy doesn’t support this yet, but it will soon, and the example here still works — it just doesn’t realize scandir()  is only available in Python 3.5.) This could be added to the end of stdlib/3/os/__init__.pyi:

from typing import Generic, AnyStr, overload, Iterator

if sys.version_info >= (3, 5):

    class DirEntry(Generic[AnyStr]):
        name = ...  # type: AnyStr
        path = ...  # type: AnyStr
        def inode(self) -> int: ...
        def is_dir(self, *, follow_symlinks: bool = ...) -> bool: ...
        def is_file(self, *, follow_symlinks: bool = ...) -> bool: ...
        def is_symlink(self) -> bool: ...
        def stat(self, *, follow_symlinks: bool = ...) -> stat_result: ...

    @overload
    def scandir() -> Iterator[DirEntry[str]]: ...
    @overload
    def scandir(path: AnyStr) -> Iterator[DirEntry[AnyStr]]: ...

Deconstructing this a bit, we see a generic class (that’s what the Generic[AnyStr]  base class means) and an overloaded function.  The scandir() definition uses @overload because it can also be called without arguments. We could also write it as follows; it’ll work either way:

    @overload
    def scandir(path: str = ...) -> Iterator[DirEntry[str]]: ...
    @overload
    def scandir(path: bytes) -> Iterator[DirEntry[bytes]]: ...

Either way there really are three ways to call scandir() , all three returning an iterable of DirEntry objects:

  • scandir() -> Iterator[DirEntry[str]] 
  • scandir(str) -> Iterator[DirEntry[str]] 
  • scandir(bytes) -> Iterator[DirEntry[bytes]] 

Adding os.fspath()

Next I’ll show how to add os.fspath() and how to add support for the __fspath__()  protocol to DirEntry .

PEP 519 defines a simple ABC (abstract base class), PathLike , with one method, __fspath__() . We need to add this to the stub for os.py , as follows:

class PathLike(Generic[AnyStr]):
    @abstractmethod
    def __fspath__(self) -> AnyStr: ...

That’s really all there is to it (except for the sys.version_info  check, which I’ll leave out here since it doesn’t really work yet). Next we define os.fspath() , which wraps this protocol. It’s slightly more complicated than just calling its argument’s __fspath__()  method, because it also handles strings and bytes. So here it is:

@overload
def fspath(path: PathLike[AnyStr]) -> AnyStr: ...
@overload
def fspath(path: AnyStr) -> AnyStr: ...

Easy enough! Next is update the definition of DirEntry . That’s easy too — in fact we only need to make it inherit from PathLike[AnyStr] , the rest is the same as the definition I gave above:

class DirEntry(PathLike[AnyStr], Generic[AnyStr]):
    # Everything else unchanged!

The only slightly complicated bit here is the extra base class Generic[AnyStr] . This seems redundant, and in fact PEP 484 says we can leave it off, but mypy doesn’t support that yet, and it’s quite harmless — this just rubs into mypy’s face that this is a generic class of one type variable (the by-now famous AnyStr ).

Finally we need to make a similar change to the stub for pathlib.py . Again, all we need to do is to make PurePath  inherit from PathLike[str] , like so:

from os import PathLike

class PurePath(PathLike[str]):
    # Everything else unchanged!

However, here we don’t add Generic , because this is not a generic class! It inherits from PathLike[str] , which is quite un-generic, since it’s PathLike specialized for just str .

Note that we don’t actually have to define the __fspath__()  method in these stubs — we’re not supposed to call them directly, and stubs don’t provide implementations, only interfaces.

Putting it all together, we see that it’s quite elegant:

for a in os.scandir('.'):
    b = os.fspath(a)
    # Here, the typechecker will know that the type of b is str!

The derivation that b has type str  is not too complicated: first, os.scandir('.')  has a str  argument, so it returns an iterator of DirEntry  objects parameterized with str , which we write as DirEntry[str] . Passing this DirEntry[str]  to os.fspath()  then takes the first of that function’s two overloads (the one with PathLike[AnyStr] ), since it doesn’t match the second one ( DirEntry  doesn’t inherit from AnyStr , because it’s neither a str  nor bytes ). Further the AnyStr type variable in PathLike[AnyStr] is solved to stand for just str , because DirEntry[str]  inherits from PathLike[str] . This is the specialized version of what the code says: DirEntry[AnyStr]  inherits from PathLike[AnyStr] .

Okay, so maybe that last paragraph was intermediate or advanced. And maybe it could be expanded. Maybe I’ll write another blog about how type inference works, but there’s a lot on that topic, and other authors have probably already written better introductory material about generics (in other languages, though).

Making things accept PathLike

There’s a bit of cleanup work that I’ve left out. PEP 519 says that many stdlib functions that currently take strings for pathnames will be modified to also accept PathLike . For example, here’s how the signatures for os.scandir()  would change:

@overload
def scandir() -> Iterator[DirEntry[str]]: ...
@overload
def scandir(path: AnyStr) -> Iterator[DirEntry[AnyStr]]: ...
@overload
def scandir(path: PathLike[AnyStr]) -> Iterator[DirEntry[AnyStr]]: ...

The first two entries are unchanged; I’ve just added a third overload. (Note that the alternative way of defining scandir() would require more changes — an indication that this way is more natural.)

I also tried doing this with a union:

@overload
def scandir() -> Iterator[DirEntry[str]]: ...
@overload
def scandir(path: Union[AnyStr, PathLike[AnyStr]]) -> Iterator[DirEntry[AnyStr]]: ...

But I couldn’t get this to work, so the extra overload is probably the best we can do. Quite a few functions will require a similar treatment, sometimes introducing overloading where none exists today (but that shouldn’t hurt anything).

A note about pathlib : since it only deals with strings, its methods (the ones that PEP 519 says should be changed anyway) should use PathLike[str]  rather than PathLike[AnyStr] .

Acknowledgments

(Thanks for comments on the draft to Stephen Turnbull, Koos Zevenhoven, Ethan Furman, and Brett Cannon.)

Tuesday, May 17, 2016

The AnyStr type variable

The AnyStr type variable

I was drafting a blog post on how to add type annotations for the new __fspath__()  protocol (PEP 519) when I realized that I should write a separate post about AnyStr . So here it is.

A simple function on strings

Let’s write a function that surrounds a string in parentheses. We’ll put it in a file named demo.py :

def parenthesize(s):
    return '(' + s + ')'

It works, too:

>>> from demo import parenthesize
>>> print(parenthesize('hola'))
(hola)

Of course, if you pass it something that’s not a string it will fail:

>>> parenthesize(42)
Traceback (most recent call last):
  File "demo.py", line 1, in
  File "demo.py", line 2, in parenthesize
TypeError: Can't convert 'int' object to str implicitly

Adding type annotations

Using PEP 484 type annotations we can clarify our little function’s signature:

def parenthesize(s: str) -> str:
    return '(' + s + ')'

Nothing to it, right? Even if you’ve never heard of PEP 484 before you can guess what this means. (Note that PEP 484 also says that the runtime behavior is unchanged. The calls I showed above will still have exactly the same effect, including the TypeError raised by parenthesize(42) .)

Polymorphic functions

Now suppose this is actually part of a networking app and we need to be able to parenthesize byte strings as well as text strings. Here’s how you’d implement that:

def parenthesize(s):
    if isinstance(s, str):
        return '(' + s + ')'
    elif isinstance(s, bytes):
        return b'(' + s + b')'
    else:
        raise TypeError(f"That's not a string, it's a {type(s)}")  # See PEP 498

With a fancy word we call that a polymorphic function. How do you write a signature for such a function? For the answer we have to dive a little deeper into PEP 484. It defines a nifty operator named Union  that lets us state that a type can be either this or that (or something else). In our case, it’s either str  or bytes , so we can write it like this:

from typing import Union

def parenthesize(s: Union[str, bytes]) -> Union[str, bytes]:
    if isinstance(s, str):
    # Etc.

Now let’s write a little main program with a bug, to show off the type checker:

from demo import parenthesize

a = parenthesize('hello')
b = parenthesize(b'hola')
c = a + b  ### bug here<-- bug="" span="">
print(c)

When we try to run this, the two parenthesize()  calls work fine (yay polymorphism!) but we get a TypeError on the last line:

$ python3 main.py 
Traceback (most recent call last):
  File "main.py", line 5, in
    c = a + b  ### bug here<-- bug="" span="">
TypeError: Can't convert 'bytes' object to str implicitly

The reason should be pretty obvious: in Python 3 you can’t mix bytes and str objects. And when we type-check this program using mypy we indeed get a type error:

$ mypy main.py 
main.py:5: error: Unsupported operand types for + (likely involving Union)

Debugging the bug

So let’s try a program without a bug:

from demo import parenthesize

a = parenthesize('hello')
b = parenthesize('hola')
c = a + b  ### bug here<-- bug="" no="" span="">
print(c)

Run it and it works great:

$ python3 main.py
(hello)(hola)

So the type checker should be happy too, right?

$ mypy main.py
main.py:5: error: Unsupported operand types for + (likely involving Union)

Whoops! The same error. What happened? Of course, I set you up, so I can explain something about type checking.

The trouble with tribbles unions

The type checker takes the signature at face value, so that when checking the call, it infers the type Union[str, bytes]  for every call to parenthesize() , regardless of what the arguments are. This is because, for most functions of even modest complexity, a type checker doesn’t understand enough about what’s going on in the function body, so it just has to believe the types in the signature (even though in this particular case it would probably be easy enough to do better).

In our test program the types of a  and b  are both inferred to be exactly what parenthesize()  claims to return, i.e., both variables have the type Union[str, bytes] . The type checker then analyzes the expression a + b , and for this it discovers a problem: if a is either str or bytes, and so is b , then the +  operator may be invoked on any of these combinations of types: str + str , str + bytes , bytes + str , or bytes + bytes . But only the first and the last are valid! In Python 3, str + bytes  or bytes + str  are invalid operations.

Aside: Even in Python 2, those two are suspect: since while 'x' + u'y'  indeed works (returning u'xy' ), other combinations will raise UnicodeDecodeError, e.g.:

>>>'Franç' + u'ois'
Traceback (most recent call last):
  File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
ordinal not in range(128)

Anyway, the type checker doesn’t like this business, and it rejects operations on Unions where some combinations are invalid. What can we do instead?

Function overloading

One option would be function overloading. PEP 484 defines a magical decorator, @overload , which lets us get around this problem. We could write something like this:

from typing import overload

@overload
def parenthesize(s: str) -> str: ...
@overload
def parenthesize(s: bytes) -> bytes: ...

This tells the type checker that if the argument is a str , the return value is also a str , and similarly for bytes . Unfortunately @overload  is only allowed in stub files, which are a kind of interface definition files that show a type checker the signatures of a module’s contents without giving the implementation.

Type variables

Fortunately there’s an even better way, using type variables. This is how it goes:

from typing import TypeVar

S = TypeVar('S')

def parenthesize(s: S) -> S:
    if isinstance(s, str):
        return '(' + s + ')'
    elif isinstance(s, bytes):
        return b'(' + s + b')'
    else:
        raise TypeError("That's not a string, dude! It's a %s" % type(s))

Well… Almost. Our main.py program (unchanged from above) now gets a clean bill of health, but when we type-check this version we get errors on both return  lines:

demo.py: note: In function "parenthesize":
demo.py:7: error: Incompatible return value type: expected S`-1, got builtins.str
demo.py:9: error: Incompatible return value type: expected S`-1, got builtins.bytes

This is a bit hard to fathom, but the fix is what I was leading up to anyway, so I’ll reveal it now:

from typing import TypeVar

S = TypeVar('S', str, bytes)

def parenthesize(s: S) -> S:
    if isinstance(s, str):
        return '(' + s + ')'
    elif isinstance(s, bytes):
        return b'(' + s + b')'
    else:
        raise TypeError("That's not a string, dude! It's a %s" % type(s))

The only changed line is this one:

S = TypeVar('S', str, bytes)

This notation is called a type variable with value restriction . Yes, it’s mouthful; we sometimes also call it a constrained type variable. S is a type variable restricted to a set of types. It also has the advantage of telling the type checker that types other than str  or bytes  are not acceptable. Without that, a call like this would have been considered valid:

x = parenthesize(42)

because the original type variable (without the restrictions) doesn't tell mypy that this is a bad idea.

In fact, this particular use case (a type variable constrained to str or bytes) is so commonly needed that it's predefined in the typing module, and all we have to do is import it:

from typing import AnyStr

def parenthesize(s: AnyStr) -> AnyStr:
    # Etc. -- trust me, it works!

Real-world use of AnyStr

In fact, this is how many polymorphic functions in the os  and os.path  modules are defined. For example, in the stub for os.py  we find definitions like the following:

def link(src: AnyStr, link_name: AnyStr) -> None: ...

and also this:

def split(path: AnyStr) -> Tuple[AnyStr, AnyStr]: ...

These show us a bit more of the power of type variables: the signature for link()  indicates that either both arguments must be str  or both must be bytes ; split()  demonstrates that the type variable may also occur in more complex constructs: splitting a str returns a tuple of two str objects, while splitting bytes returns a tuple of two bytes  objects.

That’s all I wanted to share about AnyStr . Thanks for comments on the draft to Stephen Turnbull, Koos Zevenhoven, Ethan Furman, and Brett Cannon.