Recently, I posted the following poll on Twitter:
Say iterating over a hash table gives non-deterministic results. So you iterate over it, store the key-value pairs in an array, and do a stable sort. Is the sorted result deterministic? (assume no bugs in the implementations of sorting, iteration and that ≤ is a total order)
A: Yes B: Not necessarily
I later clarified that the hash table gave you unique keys. And that the ≤ operation was applied to (key, value) pairs, with the natural interpretation that it would first compare the keys and then the values.
I'm also implicitly assuming that the equality check implied by ≤ is the same as the equality check used by the hash table, as a sensible default.
The correct answer to the question is Yes, you do recover determinism.
The interesting part of this question, in my opinion, is actually trying to do a proof of the statement. Depending on how much information you carry along, the answer can be Yes, or it can be (as a mistake) Not necessarily. This matters when trying to formalize proofs and doing library design in languages with fine-grained types. For example, in a language with algebraic effects and refinement types, you may need to propagate the right evidence to the point of sorting for the "determinism is recovered" proof to discharge a NonDeterministic
effect.
Aside: Why do I consider this problem interesting? One context it applies to is compilers. Compilers use hash tables for all sorts of things, because you need to do lookups all the time. However, at the end, you typically want to emit the information in some standard order, because you want build results to be deterministic. I'm guessing this is also Python 3.7+ guarantees deterministic iteration for dicts; more software is deterministic without each software author needing to think about it.
Now, let's step back and look at what it would take to do a proof that determinism is recovered.
First, observe that the keys being unique means that comparison will never end up comparing the values. So you don't need comparable values. Second, the keys being unique means that the stable/unstable sorting distinction is meaningless. So the problem reduces to: does sorting a list in random order but with no duplicates gives a deterministic result? The answer is yes. The position of each element in the sorted list only depends on the number of elements that were smaller than it in the original list, which is independent of how the elements were arranged originally.
However, if you try to break this proof up into two pieces:
- Iterating over the hashtable and collecting the results into an array.
- Sorting an array in random order recovers determinism.
The proof no longer quite works. Why? A proof of point 2 relies on the extensionality of ==
(this is the ==
implied by ≤). Specifically, if there exists some function f
such that two elements in the list a == b
but f(a) != f(b)
(!=
should be interpreted in the codomain of f
), then even after sorting the array, someone may be able to observe artifacts of the random order by mapping f
over the array.
So to summarize: you can either do a "direct" proof if you propagate the uniqueness of keys. Or you can do a more modular proof (it has a subpart that could be reused elsewhere!) if you have access to a proof of extensionality instead of unique keys.
I find this interesting because in most languages, the documentation doesn't mention extensionality as a requirement of ==
. Also, it would be a self-own if they did, because they allow comparing IEEE floating point values with ==
(Rust being a notable exception). I suspect another part of the reason is that people aren't used to thinking about relatively abstract properties like this (e.g. Liskov substitution principle gets a lot of airtime in tech talks, but ~most code I've seen in practice doesn't follow it strictly). However, it is likely that programmers have used this kind of reasoning (without explicitly thinking about extensionality) to justify sorting as a means of recovering determinism. It works out in practice because many common types have an extensional ==
. So it's one of those cases where fast and loose reasoning in accidentally correct.
Keeping mainstream languages aside, I think this is interesting for languages with very rich type systems, such as refinement types and dependent types. In such a language, breaking proofs apart is analogous to writing a program in a modular fashion. As this small example shows, making proofs modular requires careful thinking, and is arguably harder than writing a modular program, because you need to anticipate what proofs someone may want to do with your values. Of course, I'm not claiming any novelty with this observation, I think this is quite well-known in the dependent types community. However, I think this point is worth re-iterating and spreading widely -- programmers often want increasingly sophisticated type systems (I plead guilty myself), but they don't necessarily think through all the ramifications. Library design without type-level proofs is already quite hard; I think library design with proofs has quite a ways to go before becoming fully mainstream.