Keep Working, Worker Bee!


Are disjoint unions a lurking horror of typed programming languages?

I've been reading about the XDuce project recently for our PL reading group here at Chicago. XDuce is a domain-specific language for writing XML transformations in a totally type-safe way: not only do XML values need to match their schemas, but your program needs to respect that schema too. For instance, if you're writing a program to process XML documents that look like

<A><B>[some string]</B><B>[some other string]</B> ...</A>

then the type system won't let you try to select a tag named <C>...<C> out of it.

To make that happen, XDuce introduces a really neat idea called regular-expression types. Regular expression types are just what they sound like: in addition to ground types like string, you also get sequencing, Kleene-star, alternation, and so on. (Also, because it's XML, you get "tag types": A[ty] is a tag named A whose contents have the given type ty.) So a possible type for the XML documents above might be A[B[string]*]. Pretty cool. But the really cool part is that they define the subtyping relation to be the same as the subset relation: each type describes a set of trees, and if you have two types t1 and t2, and t1 describes a strict subset of t2 then t1 <: t2. For instance,

A[B[string],B[string],B[string]] <: A[B[string]*]

A[B[string]*] <: A[any]

This is given in the XDuce literature purely as a way of making a type system that would work with XML data, and I think it's obviously useful for that purpose. During our discussion, though, the reading group started speculating as to whether it would be useful for general programming too. The MLians in the group (i.e. everyone but me) argued that it wouldn't be, since ML's tagged union types give you the same power in a more organized way.

That got me thinking: at the most fundamental level, how is this system different from ML's normal system? I think the answer is, ML forces all unions to be completely disjoint. You can say
datatype myunion = A of int | B of int
but any value of type myunion can only ever be one of the two possibilities, not both simultaneously. At its core, I think XDuce's regular expression types don't force that on you: you might well have a single value that is an instance of both alternatives in a union type, and that's where both the power and the complication come in.

So here's what I'm wondering now. When programming in ML, I've never actively noticed myself contorting programs so that I can make a particular union type disjoint when it wouldn't naturally be that way, but is that just because I've never thought of that as an option? Are lots of ML programs more complicated than they ought to be because of this restriction?

I think every language has its set of "lurking horrors," limitations or design decisions that force you to write particular programs in unnatural ways. I call them "lurking" because you don't necessarily realize you're being limited while you're coding; usually you've got to have some sort of epiphany, like learning a new language that doesn't have the same restriction, to realize that you've been tied up the whole time. Lack of higher-order functions is a big example of a lurking horror; type systems without polymorphism or option types are another. Are disjoint unions a third?


  • I don't have a strong argument either way about your question, but some thoughts follow.

    If I have a function that accepts 32-bit integers and you try to give it a 16-bit integer, is this an error, or should some automatic conversion be applied?

    Each answer to this question yields a different perspective on type systems. ML's answer is a most emphatic "Yes, it's an error." In the ML philsophy, these kinds of errors are so important that the language is built around generative types to help you catch even more of them. To put some names with your example, consider

    datatype measure
    = Inches of int
    | Centimeters of int

    The whole idea behind these union types is that the tag conveys important information beyond what raw data is actually stored. Ok, great. But, to borrow from the XDuce examples, what about

    datatype person
    = A of { name : string,
    email : string,
    tel : string * string * string }
    | B of { name : string,
    email : string,
    tel : string list }

    The idea here is for A to mimic a limited sequence of three phone numbers by use of a 3-tuple of strings, where B mimics the Kleene closure via a list.

    Now, my question is: how important is the difference between A and B for your code? Maybe it's an important difference because your database has three fields for phone numbers, and people with more than three numbers will need to be treated differently. In this case, though, even though a person of kind A can be viewed as a person of kind B, the whole point of the union is to keep the two distinct. But maybe you want the best of both worlds -- you want to keep the distinction, but treat the notion of a "person" uniformly. Well, then you wrap it up in a person module that lets you deal with phone numbers in a convenient way (probably as a list) and then does whatever is appropriate internally.

    On the other hand, if you really don't care about the difference between A and B, I think the argument can be made that separating them via a union is the wrong thing to do. And even if ML doesn't let you treat a 3-tuple as a subtype of a list (for example), this just seems like a case of trying to use a 16-bit integer for a 32-bit integer parameter: in ML's mindset, that needs an explicit conversion, because ML is all about catching those "errors" and the assumption that they are indeed errors.

    So, my feeling is that either you want to treat two superficially different things separately, or uniformly. For the former, (disjoint) union types are exactly what you want. For the latter, a single, convenient, canonical representation seems the way to go, with appropriate conversions taking place beforehand.

    By Blogger adrassi, at 17:04  

Post a Comment

<< Home