A Safe Type-Indexed Set for Standard ML

April 08, 2017

Update

The solution I present in this post turns out not to be safe. See the subsequent post for an explanation. What follows is the original article.

I recently discovered, while preparing the slides for my Modularity à la ML talk, a solution to a problem I've come across a while ago while implementing a type-indexed set data structure in Standard ML.

Functor-Based Data Structures

First of all, let's see how sets are represented in the SML/NJ library, which is what I've used initially. The SML/NJ library provides a collection of common data structures and algorithms that aren't part of the standard Basis library mandated by Standard ML. For set data structures, it exposes an ORD_SET signature that's meant for sets implemented using a predefined ordering on the contained elements (there is support for hash sets, too). Two functors produce structures that satisfy the ORD_SET signature: BinarySetFn and SplaySetFn. Here's a sample of using BinarySetFn with integer elements:

structure Main =
  struct
    (*
     * Create an IntSet module by instantiating the BinarySetFn
     * functor with a structure that provides ordering on integers.
     *)
    structure IntSet = BinarySetFn(struct
      type ord_key = int
      val compare = Int.compare
    end)

    fun main () =
      let
        val set1 = IntSet.fromList [1, 2, 3]
        val set2 = IntSet.fromList [2, 3, 4]
        val set3 = IntSet.intersection (set1, set2)
      in
        IntSet.toList set3 (* [2, 3] *)
      end
  end

The problem with this approach is that it's functor-based, which means that one has to call the BinarySetFn functor for each and every type of element that they want to put in a set data structure. You want a set of strings? Then you call the BinarySetFn functor with a structure implementing ordering on strings. You want a set of sets? Same thing: call the functor with an ordering on sets.

This is annoying because it's just too verbose and it's due to the fact that structures are not first-class values. They're part of the module language of Standard ML, which lives separately from the core language of values. Is there a better way? Maybe.

Type-Indexed Data Structures

At some point I decided I wanted to play around with a different design for such data structures that was not functor-based. The idea was to pass the ordering function as a parameter to the builder functions of the structures. Here's what I mean by that. The current ORD_SET signature looks similar to this:

signature ORD_KEY :
  sig
    type ord_key
    val compare : ord_key * ord_key -> order
  end

signature ORD_SET :
  sig
    structure Key : ORD_KEY

    type set
    type item = Key.ord_key

    val empty : set
    val singleton : item -> set
  end

So the ordering function is part of a sub-structure of the signature: ORD_KEY. What if instead we'd do this and eliminate the ORD_KEY signature entirely?

signature ORD_SET =
  sig
    type 'a set

    val empty : ('a * 'a -> order) -> 'a set
    val singleton : ('a * 'a -> order) -> 'a -> 'a set
  end

Now, the two "factory" functions will take a polymorphic ordering as an argument. The type parameter 'a will be the type of set elements. Presumably this ordering function will be stored inside a data structure that implements that abstract set type. Let's see how this could look like with a naive list-based set.

structure ListSet :> ORD_SET =
  struct
    type 'a set = {
      compare : 'a * 'a -> order,
      elems : 'a list
    }

    fun empty compare =
      {
        compare = compare,
        elems = []
      }

    fun singleton compare elem =
      {
        compare = compare,
        elems = [elem]
      }
  end

The set is represented using a record that stores the elements as a list, together with the ordering function for those elements. We can now add a couple other functions for inserting elements into an existing set and for obtaining the list representation back, which is now held hidden.

signature ORD_SET =
  sig
    (* continued *)
    val toList : 'a set -> 'a list
    val insert : 'a set -> 'a -> 'a set
  end

structure ListSet :> ORD_SET =
  struct
    (* continued *)
    fun toList { compare, elems } =
      elems

    fun insert { compare, elems } elem =
      let
        fun loop elems =
          case elems of
            [] => [elem]
          | head :: tail =>
              case compare (head, elem) of
                LESS => head :: (loop tail)
              | EQUAL => elems
              | GREATER => elem :: elems
      in
        {
          compare = compare,
          elems = loop elems
        }
      end
  end

I won't go in detail over the implementation of insert, but, in short, what it does is that it keeps the elements list sorted so that it's easier to find whether an element already exists in the list. It's still naive for a real set implementation that will most likely use a balanced tree, but it'll do. Let's now see how to use it:

- val intSet = ListSet.empty Int.compare;
- val strSet = ListSet.empty String.compare;
- val set1 = ListSet.insert intSet 1;
- val set2 = ListSet.insert set1 2;
- ListSet.toList set2; (* [1, 2] *)
- ListSet.insert strSet 2;
stdIn:16.1-16.24 Error: operator and operand don't agree [overload conflict]
  operator domain: string
  operand:         [int ty]
  in expression:
    (ListSet.insert strSet) 2
- ListSet.insert intSet "2";
stdIn:1.2-2.10 Error: operator and operand don't agree [tycon mismatch]
  operator domain: int
  operand:         string
  in expression:
    (ListSet.insert intSet) "2"
-

The Problem

As you can see, we didn't have to instantiate any functors here. The set is polymorphic just like a normal list and it's typesafe — one cannot put a string inside an int set or vice-versa. This is nice, but we're just about to hit the actual problem. Let's think how we would go about implementing a common set operation: union.

signature ORD_SET =
  sig
    (* continued *)
    val union : 'a set -> 'a set -> 'a set
  end

structure ListSet :> ORD_SET =
  struct
    (* continued *)
    fun union set1 set2 =
      let
        val { compare = compare1, elems = elems1 } = set1
        val { compare = compare2, elems = elems2 } = set2
      in
        raise Fail "not implemented"
      end
  end

I'm not going to fill in the actual implementation because what we have is enough to illustrate the problem. The issue we're facing is that we have two sets, each providing its own compare function and there's absolutely no guarantee that these two functions behave the same. For example, the following code typechecks, but it should't, because we're trying to merge sets with different orderings:

- fun reverseOrd ord (a, b) =
=  case ord (a, b) of
=    LESS => GREATER
=  | EQUAL => EQUAL
=  | GREATER => LESS;
-
- val set1 = ListSet.empty Int.compare;
- val set2 = ListSet.empty (reverseOrd Int.compare);
- val set3 = ListSet.union set1 set2; (* shouldn't compile, but it does *)

The same problem can manifest when the ordering function isn't stored as part of the set representation, but it's instead passed as an extra argument to each of the set operations, either explicitly or implicitly, i.e., using type-classes. It's actually even more hideous in that case because it creates problems even for the insert operation (imagine two insert calls on the same set instance, but supplying different orderings to each call). This is a problem that can appear in Scala, for example, whose type-class emulation does not guarantee coherence, i.e., a given type can only have a single associated instance of a particular type-class.

So, is there a solution to this problem in SML? Initially I thought there isn't, but a few days ago I realized there might be one, albeit not that simple.

The Solution

To recap, the problem appears for this call, where we can't statically guarantee that the two set instances carry the same ordering:

ListSet.union set1 set2

One idea of ensuring this property is to somehow tag the orderings through the type system; associate each ordering function with a unique type. We can achive this by wrapping an ordering function in a structure that opaquely exposes the type of the function. And two opaque types are equal only if they're part of the same signature. So, the first step towards opaque types is a signature, which will let us hide the type in the structure(s):

signature TAGGED_ORD =
  sig
    type t
    val compare : t * t -> order
  end

This looks exactly as the ORD_KEY signature defined by the SML/NJ library, but we're not done yet. Let's try to use it in a way that wasn't meant to be:

structure IntOrd :> TAGGED_ORD =
  struct
    type t = int
    val compare = Int.compare
  end

Notice the :> part where we're specifying that the structure must comply with the signature. The difference between : and :> is that :> will keep the representation of types hidden. In our particular case, it renders the IntOrd structure useless. We cannot use it at all:

- IntOrd.compare (1, 2);
stdIn:35.1-35.22 Error: operator and operand don't agree [overload conflict]
  operator domain: IntOrd.t * IntOrd.t
  operand:         [int ty] * [int ty]
  in expression:
    IntOrd.compare (1,2)

This says that IntOrd.compare only accepts types IntOrd.t, not int. We know they're the same, but the compiler does not because of the opaque signature ascription :>. What's to do? We need to add a new function to the signature that will let us lift int values to IntOrd.t values:

signature TAGGED_ORD =
  sig
    type t
    val wrap : int -> t
    val compare : t * t -> order
  end

structure IntOrd :> TAGGED_ORD =
  struct
    type t = int

    fun wrap a = a
    val compare = Int.compare
  end

We can now order integers again:

- IntOrd.compare (IntOrd.wrap 1, IntOrd.wrap 2);
val it = LESS : order

What the TAGGED_ORD.t represents now is basically the singleton type of the compare function defined in the structure. There will be no other type that will be equal to it, except itself.

We're still missing a thing, however. The type-indexed ListSet module will now need not only a compare function, but also a wrap function. And because modules aren't first-class values in Standard ML, we cannot pass our TAGGED_ORD structures to the ListSet.empty function. We need a way to pack those two functions together in a record and pass that instead. So, our final TAGGED_ORD signature looks like this:

signature TAGGED_ORD =
  sig
    type t
    type a
    val wrap : int -> t
    val compare : t * t -> order
    val asRecord : { wrap : a -> t, compare : t * t -> order }
  end

Notice that we had to introduce another type, a, that will represent the type of values we're lifting from. Now IntOrd becomes this (notice that the a type is exposed, but t is not, through what it's called a translucent signature ascription):

structure IntOrd :> TAGGED_ORD where type a = int =
  struct
    type t = int
    type a = int

    val wrap = Fn.id
    val compare = Int.compare
    val asRecord = { wrap = wrap, compare = compare }
  end

We're now finally ready to change our ORD_SET signature and ListSet structure to work with this new type of "tagged" orderings:

signature ORD_SET =
  sig
    type ('a, 'ord_t) set

    type ('a, 'ord_t) compare = {
      wrap : 'a -> 'ord_t,
      compare : 'ord_t * 'ord_t -> order
    }

    val empty : ('a, 'ord_t) compare -> ('a, 'ord_t) set
    val singleton : ('a, 'ord_t) compare -> 'a -> ('a, 'ord_t) set
    val toList : ('a, 'ord_t) set -> 'a list
    val insert : ('a, 'ord_t) set -> 'a -> ('a, 'ord_t) set
    val union : ('a, 'ord_t) set -> ('a, 'ord_t) set -> ('a, 'ord_t) set
  end

The 'ord_t type parameter in type ('a, 'ord_t) set represents the unique type associated with a wrap/compare record. So the set type isn't indexed only on the type of the containing elements, but also on the type of the ordering function. This new type parameter will ensure the type safety we're after.

Here's how the ListSet structure needs to change now:

structure ListSet :> ORD_SET =
  struct
    type ('a, 'ord_t) set = {
      wrap : 'a -> 'ord_t,
      compare : 'ord_t * 'ord_t -> order,
      elems : 'a list
    }

    type ('a, 'ord_t) compare = {
      wrap : 'a -> 'ord_t,
      compare : 'ord_t * 'ord_t -> order
    }

    fun empty { wrap, compare } =
      {
        wrap = wrap,
        compare = compare,
        elems = []
      }

    fun singleton { wrap, compare } elem =
      {
        wrap = wrap,
        compare = compare,
        elems = [elem]
      }

    fun toList { wrap, compare, elems } =
      elems

    fun insert { wrap, compare, elems } elem =
      let
        fun loop elems =
          case elems of
            [] => [elem]
          | head :: tail =>
              case compare (wrap head, wrap elem) of
                LESS => head :: (loop tail)
              | EQUAL => elems
              | GREATER => elem :: elems
      in
        {
          wrap = wrap,
          compare = compare,
          elems = loop elems
        }
      end

    fun union set1 set2 =
      let
        val { wrap = wrap1, compare = compare1, elems = elems1 } = set1
        val { wrap = wrap2, compare = compare2, elems = elems2 } = set2
      in
        raise Fail "not implemented"
      end
  end

The most important difference is in the insert function, where we need to call wrap on the elements that we want to pass to compare:

case compare (wrap head, wrap elem) of

The same thing would have to happen in the implementation of union as well, but we don't have one yet. All we're interested in is whether illegal calls are refused by the type checker. To check that, let's define yet another ordering on integers, a reverse ordering:

structure RevIntOrd :> TAGGED_ORD where type a = int =
  struct
    type t = int
    type a = int

    fun reverseOrd ord (a, b) =
      case ord (a, b) of
        LESS => GREATER
      | EQUAL => EQUAL
      | GREATER => LESS

    val wrap = Fn.id
    val compare = reverseOrd Int.compare
    val asRecord = { wrap = wrap, compare = compare }
  end

Finally, we can verify whether we can union two sets based on different orderings:

- val set1 = ListSet.empty IntOrd.asRecord;
val set1 = - : (a,IntOrd.t) ListSet.set
- val set2 = ListSet.empty RevIntOrd.asRecord;
val set2 = - : (a,RevIntOrd.t) ListSet.set
- ListSet.union set1 set2;
stdIn:110.1-110.24 Error: operator and operand don't agree [tycon mismatch]
  operator domain: (a,IntOrd.t) ListSet.set
  operand:         (a,RevIntOrd.t) ListSet.set
  in expression:
    (ListSet.union set1) set2
-

Success! Another illegal state made unrepresentable! Also, notice how the error message lets us know that we've been trying to use difference orderings (IntOrd.t vs RevIntOrd.t):

  operator domain: (a,IntOrd.t) ListSet.set
  operand:         (a,RevIntOrd.t) ListSet.set

That's all, folks! You can find the final version of the code in this gist.