You shouldnʼt just throw up your hands once you have cyclic references! Itʼs possible and worthwhile to design tools to work with the raw graph of runtime data, no matter its shape. With the proper metaprogramming hooks, you could save and restore a whole runtime environment *from the inside*, which is pretty amazing to think about.

(Well, okay, you might give up at things like functions, if there isnʼt enough metaprogramming, or running sockets, for example – thatʼs okay!)

The main thing we will work up to (and go beyond) is Pythonʼs `pickle`

module, including my own implementation of it for the Nasal scripting language (which I believe is cleaner in some respects, although obviously less industrial).

This post has spent a long time marinating in my mind, so Iʼm happy to finally get it on the page!

Iʼve identified some basic aspects of data:

- Pure (immutable) data
- Either acyclic …
- … or allowed to be cyclic

- Mutable data (say, in-memory)
- Youʼre basically forced to allow cyclic references!
^{1} - Sharing matters!

- Youʼre basically forced to allow cyclic references!
- Data defined in reference to external things (like file or socket handles)
- External is relative – as it should be!
- Your need to think about capturing intent
- e.g. file paths are probably your best indicator of intent – but if it was a temporary file, you could probably create a new temporary file instead! but oh now what if a different string is storing the path of the file, now you donʼt know how those pieces of data relate anymore and cannot update that string …

- Functions?
*Are*functions data?? (Iʼm including closures and procedures and routines and all that stuff in my definition of functions, btw.) - Quotiented data (Agda, Lean, Rocq)

It will be important to think about what notions of equality are useful and adequate for these kinds of data!

We will talk about runtimes, including what garbage collectors do. Garbage collection is pretty special, since it gets into all the nitty gritty implementation details of a runtime: it essentially pins down the data model of most languages that have garbage collection.

It is also important to talk about what capabilities the language provides for working with and thus distinguishing data. If the difference cannot be observed, it does not exist! (And we may be “morally” justified in ignoring it anyways.)

Serializing in its basic form – for pure data – means writing it out to a string such that it can be recovered from that string. (Bitstring, bytestring, Unicode string, even a bignum – the exact details do not matter. It also does not matter if it is a well packed encoding or a sparse encoding with lots of invalid strings that do not deserialize.)

Pickling is more general: it means serializing non-pure data by taking a snapshot of mutable data, preserving referential identity in the process (this is hard!), and doing your best with external references, and maybe giving up on functions.

Part of my argument is that serializing and pickling is very tied up in what data means!

Like, What is the essence of data? I hope it is something you can grasp fully, explain fully, and write out to a file and fully reconstruct in a new runtime context. (Mumble mumble: decidable equality, computation, countability.)

I think I have definitions of equal and equivalent that are useful.

Equality in the classical sense^{2} is going to be too strict, I argue, for the notion of data I want to consider. First of all, it is hard to compare values across runtimes, which is silly! Data does not only exist for an instant in time! And two mutable objects can be *interchangeable* even if they are not *identical* references.

This discrepancy happens as soon as you have mutable references, which you could tell apart if you have both to compare, but otherwise would act similar.

So I will use “equal” to mean “values that live within the same run of the runtime and will definitely act the same, thus equal in all ways the language could distinguish^{3}”.

And I will use “equivalent” to mean “values that, if they exist in the same run of the runtime, would cause no difference if all references to them were swapped and then their children were shallowly swapped; otherwise, values that would act as similar as possible across different runs of the runtime”.

We definitely need to walk through an example of what “equivalent” means to unpack my definition. Hereʼs one in JavaScript, although the particulars of JavaScript do not matter:

```
// Set up some data
let shared = [];
let v0 = { 0: shared, 1: [], 2: 2 };
v0[3] = v0;
let v1 = { 0: shared, 1: [], 2: 2 };
v1[3] = v1;
shared.push(v0, v1);
// I hope you agree that this characterizes
// the state of the data, most or less
for (let v of [v0, v1]) {
// shared reference
assert(v[0] === shared);
// equivalent empty arrays,
// though not equal
assert(v[1].length === 0);
// index 2 is 2
assert(v[2] === 2);
// self reference
assert(v[3] === v);
}
// and they are referenced by shared in order
assert(shared[0] === v0);
assert(shared[1] === v1);
// they are not equal:
assert(v0 !== v1);
// but we can swap them, ...
[v0, v1] = [v1, v0];
// ... all references to them, ...
[v0[3], v1[3]] = [v1[3], v0[3]]; // yes, self references count
[shared[0], shared[1]] = [shared[1], shared[0]];
// ... and the immediate references they hold,
// since they have the same shallow structure
[v0[0], v1[0]] = [v1[0], v0[0]];
[v0[1], v1[1]] = [v1[1], v0[1]];
[v0[2], v1[2]] = [v1[2], v0[2]];
[v0[3], v1[3]] = [v1[3], v0[3]];
// this last one fixes up their self references again!
// Now we have the same observable state of the world!
for (let v of [v0, v1]) {
// shared reference
assert(v[0] === shared);
// equivalent empty arrays,
// though not equal
assert(v[1].length === 0);
// index 2 is 2
assert(v[2] === 2);
// self reference
assert(v[3] === v);
}
// and they are still referenced by shared, in the
// order corresponding to the same variable names
assert(shared[0] === v0);
assert(shared[1] === v1);
// so we conclude that `v0` and `v1` are equivalent!
```

Note that “equal” and “equivalent” are, strictly speaking, external notions. However, “equal” is often testable internally, such as by `===`

. (Except for the case of `NaN`

– you technically have to use `a === b || (a !== a && b !== b)`

to detect `NaN ==== NaN`

, and I believe that different `NaN`

s are indistinguishable.)

“Equivalent” is much harder to test: for functions it will be impossible, but even for data made out of simple JSON-like structures it takes a fair amount of bookkeeping and some trickery to truly decide it.

In Dhall, equal and equivalent happen to be the same!^{4} Unfortunately this is more a reflection of the rather stagnant notion of data in Dhall – no mutability, no references, no cyclic structures.

As soon as you have mutable references, equality and equivalence will not be the same.

(This is because you can always^{5} use test mutations to tell when two references are identical or not, so you cannot just outlaw referential equality and hope that they become indiscernable.)

And practically speaking, there are always wrinkles.

Maybe there are bignums who can be observed to have different encoding by esoteric functions, but both encode the same number – you would be justified in wanting to consider them equal. (Especially if this is a distinction that can be observed by those esoteric functions but not reflected by a pickling function, for example. That is, if you could know that one number has an overlong encoding by virtue of being observably *different* from the other number, but not know which number is overlong, or, indeed, how overlong it is. Well, I suppose pickling would mean you have a canonical, short form, so it would just be a question of not knowing how overlong it is – maybe even being able to reconstruct an overlong form without trying a bunch of weird math operations to force decanonicalization, which would potentially get undecidable.)

Letʼs talk about concepts of data that exist across specific languages and runtimes!

As a warm-up to thinking about references and pointers: Letʼs observe that immutable strings are treated specially by a lot of runtimes!

V8 has a lot of different representations of strings, and it is free to change representation between them, e.g. while garbage collecting. It is free to make these changes since they arenʼt observable from JavaScript (outside of performance characteristics).

Erlang likewise has immutable binaries (bytestrings) which serve similar purposes although they only support fast slicing (no fast concatenation). To get fast concatenation, users are expected to build up nested lists and flatten it out into one binary at the end of all the concatenation operations. (Luckily it is rare to want to index back into a binary as your are building it up, so this is acceptable for most use-cases.)

The thing that the Erlang garbage collector is specifically allowed to do is to delete old binaries whose contents are only partially referenced. Then it has the task of fixing up the references to those slices to refer to the new binaries with adjusted indices. This is definitely mutating data at some level (e.g. the level of raw memory), but it is not visible from Erlang at all.

Anyways, this is just a nicer example of stuff the runtime can do behind the scenes that has a very precise semantic description. It happens “behind the scenes” because changes occur in memory that do not effect equality, as seen from the language. As we start to think about references and managed pointers and external references and functions, it gets more complicated!

It is also a good warning to be careful about what “immutable” means: since every runtime uses mutation at some level, we only care about immutability as viewed from the language itself.

Mutability should be one of the first things you think of when you think about data, especially if you have had some exposure to ideas from Functional Programming (FP).

The really nice thing about pure languages like Haskell and PureScript is that they separate out mutable references from pure data, and dispense with mutable variables altogether. It affords us a much nicer toolbox for expressing these concepts, in theory and in code.

One way to say it is that pure data is characterized by its serialization. If it makes no reasonable difference whether you keep using the original value versus the value after it has been serialized and deserialized, then that is pure data. (This is a reformulation of referential transparency.)

Mutable data cares about its referential **identity**, on the other hand. (Note that “reference” most often means something like “pointer”, but it doesnʼt have to be restricted to that.)

The thing about pure data is that it essentially doesnʼt really care if it is a tree or a DAG. I mean, it matters in terms of efficiency and storage space! (Though this is rarely exposed at a useful level.^{6}) But the results of an algorithm wonʼt be changed just by sharing more or less internal structure among the immutable data.

Mutable data pretty obviously cares about sharing.

Functions are tough. When viewed from above (mathematically) and below (most runtime implementations), they cannot be sensibly serialized.

One thing that is always true^{7} is that you **cannot** expect to know when two functions are equivalent. They may happen to be equivalent in a obvious way, but if they arenʼt, you cannot necessarily find out that they actually have distinct behavior. (In other words: equivalence is semidecidable [reference needed].)

If you have the original syntax defining the functions, you *do* have a snowballʼs chance in hell of deciding that two functions are equivalent, by doing some procedures to normalize the syntactic definitions of the functions. (One difficulty is that you will also have to keep track of closures, including captured variables and open variables.)

But once the syntax is forgotten (viz. at runtime), the best you can do is a pointer comparison, which is a very poor decider of equivalence.

The funny thing is that, although all functions start out as syntax, syntactic normalization is very rare for languages.

Only a few term-based languages like Dhall will always keep functions as syntax^{8}, and only theorem provers like Agda/Lean/Rocq will keep the function syntax around for their core operation (whereas code extraction and runtime is almost entirely a separate issue).

(Aside: In theory you could defunctionalize a whole closed-system program right? At least for pure data? Maybe for mutable too?)

However, in languages like Agda, each function definition is considered distinct, so it might as well be a pointer check. (This is true for almost all theorem prover languages, certainly for recursively-defined functions – Dhall just avoids the issue by not having recursive functions!)

To really talk about data, we have to talk about specific runtimes. Weʼll work our way up from simplest to more thorny.

Languages that nobody actually runs. Theyʼre just too academic or something. (Joking.)

JSON is a pretty great place to start.

JSON is pure, immutable data. It is acyclic and unshared. If you want to represent sharing, you need to represent it at a different level! By encoding some notion of identities and references, that is.

Many other things (like binary data/bitstrings) are mathematically equivalent to JSON in this regard. They are “just” pure data, in which you can encode any other pure data, with more or less help from existing structure like arrays and objects.

There are some wrinkles, like the fact that numeric precision is not specified, and the order of keys may matter to some runtimes and not others (okay, practically all runtimes mostly preserve order by now).

Note that JSON is often deserialized to *mutable* data. This means that *mutable* data may not roundtrip through JSON: shared values will be unshared. But Iʼm getting ahead of myself.

From an abstract perspective, Dhall is pretty much like JSON, with a couple key differences:

- The data is strictly typed.
- You can serialize functions!

Dhall comes with a good notion of equality – judgmental equality – which applies normalization to open terms (expressions with free variables – in particular, functions). By applying algebraic reasoning, judgmental equality can sometimes show that two functions are equal, but not always!

(I use this mostly as a catch-all for theorem provers.)

Agda is kind of like Dhall, but its emphasis is different, and it has more features. As mentioned somewhere here, recursive functions mean that it gives up even earlier on detecting equality of functions. The other important new feature is quotients.

Quotients are interesting: they are – at some level – represented by the same (runtime) data as everything else, but some of the underlying data has to be treated as equal by *everything* (in particular, by equality, and thus by the rest of the theory, since everything has to preserve equality).

Specifically, Higher Inductive Types and Observational Type Theory are interesting topics here, but way too deep to get into at the moment (and well covered by others – not something I feel much need to opine on in this context).

Okay maybe some people actually run these languages and interact with their runtimes.

Erlang is pretty interesting, for an untyped language. Data is fundamentally immutable^{9} and acyclic and this has influenced many aspects of its design.

This enables somewhat exotic features like non-linear pattern matching and deep equality, which are unheard of in untyped, mutable languages. (These are the same feature, actually. Non-linear pattern matching is implemented via deep equality.)

In fact, Erlang specifies a very specific serialization format for data, for seamlessly communicating across distributed systems.

It is not the case that Erlang processes are data. That is, processes themselves are not just a kind of data. Erlang does not claim to let you save the full state of a process off to disk and restart it from the saved data. That would be a little silly – what would happen to all the other processes it was bound to? No, processes are part of a dynamic runtime system.

However, a *reference* to a process *is* “just” data still, and can be compared for equality like any other data, and it can be serialized.

(The same for BEAM functions and closures, btw. Their references are opaque in some sense, but still just data.)

And interestingly, the dynamic, asynchronous nature of processes means that they *must* expose mutation.

Indeed, one way to implement a mutable reference is to spin up a process that holds the “current” value^{10} and returns it when queried and purely updates it in its state. That is, its state is immutable, but external callers can see fresh data whenever they query the process.

Haskell is actually one of the most complex ones here, since there are many levels you can talk about data at.

One the one hand, you can talk about pure data all day long. You can pretend that it operates like Agda or Dhall – totally pure! (Except you cannot compare functions for equality [except you can].)

You can even add mutable references (again with equality [again with the exception]). Itʼs actually really beneficial to have this separation between mutable references and immutable data that they contain, but I didnʼt get into it here.

Mutable references form a possibly cyclic directed graph, just like the imperative languages we will be talking about. But Haskell can also form cyclic data references via laziness: “tying the knot”, as it is called.

However, Haskell is a bit weird in that you can also peek inside the machine, and compare pointers and do other “naughty” stuff. This can be used to short-circuit deep equality comparisons via referential equality, memoize functions on uncomparable types, and other legitimate uses^{11}. Except it isnʼt so naughty (if you do it right), it is just a different layer of abstraction.

In fact, if you drill down, Haskell has some kind of data model that its runtime operates on. This will tell you when it is okay to use `unsafeCoerce`

, for example. Itʼs maybe worth talking about the way Haskell evaluates, what its thunks represent, how mutable and immutable data work, STM, FFI,— but it just goes on for ages.

I think itʼs really worth thinking about this deeply, taking seriously the kinds of data that Haskell uses at runtime, and how references to them interact, how the garbage collector makes sense of it all.

But as we will see, an awful lot of runtimes seem to be about managing the graph of references of data, and itʼs useful to be able to work with those graphs at some point, even if the language is hesitant to give it up so easily. (Allowing a program unmanaged access to its own heap would be a disaster. Itʼs understandable, really.)

JavaScript will be our stand-in for a scripting language with a simple, dynamically typed, mutable data model (and garbage collector).

Everyoneʼs familiar with the JSON side of JavaScript: it stands for JavaScript Object Notation, after all. But once you embed in the larger language, you can get things like cyclic references via mutation. Even things like arrays work differently than JSON promised.^{12}

As a silly example,

```
// this is the “weird” JS `const`
// where the variable reference is
// constant, but the data is mutable!
const selfRef = [];
selfRef.push(selfRef);
console.log(selfRef[0] === selfRef);
console.log(JSON.stringify(selfRef));
// Uncaught TypeError: cyclic object value
```

This is no longer JSON serializable!

This one is still JSON serializable, but it no longer behaves the same after deserialization:

```
const twin = {name: "Lo"};
const twins = [twin, twin];
const gemini = JSON.parse(JSON.stringify(twins));
twins[1].name = "Hi";
gemini[1].name = "Hi";
console.log(JSON.stringify(twins)); // [{name: "Hi"}, {name: "Hi"}]
console.log(JSON.stringify(gemini)); // [{name: "Lo"}, {name: "Hi"}]
```

So JavaScript, by allowing mutation, can observe cyclic and shared references that JSON simply does not have.

Most people donʼt reckon with this aspect of the runtime at a deep level! Obviously they throw mutable references around all the time and understand that they are shared, and will design some kind of serialization format that uses IDs or something and then reconstruct the right shared references on top of that. But they donʼt build debugging tools for JavaScript itself that would work with arbitrary data.

But what if you didnʼt have to?

Thereʼs actually a way to make a serialization so that `gemini`

behaves like `twins`

. I call this pickling, after the Python library. More on this later.

Not much worth saying about functions. You can get their source code in JavaScript (why??), but you cannot observe their closure, so you cannot pickle them up thoroughly.

As will be the theme, references to foreign data (not strings, numbers, objects, arrays) are tough.

Global object types are worth thinking about. They are opaque to code, but in theory they live in predictable places in the global namespace each time, so the proper reference to them can be reconstructed. However, thereʼs still complications, such as that different runtimes will expose different ones (Chrome, Firefox, Node, Deno, …).

By far the most common types of foreign objects will be from the DOM. Some can be serialized pretty directly – at least snapshotted if they arenʼt immutable (like all the little attribute list or node list types, or bounding box type).

If you have a reference to an element with an `id`

, you might expect that you could serialize it, and then reload the page, and have it still refer to “that” element. But “that” element doesnʼt exist anymore. Maybe thereʼs a new one with the same `id`

– maybe the `id`

doesnʼt exist anymore! Well, it is sort of the best marker of intent there is.

And so we come to the conclusion that we will always be struggling with the API boundaries of runtimes, of data that isnʼt constructed from within the language itself. Once you have data inside your system that references things outside, how do you deal with it? What kind of guarantees could you still get when persisting it?

It is worth noting that NodeJS debugging facilities finally implemented support for detecting cyclic references when printing out structures and letting you know what the reference is.

Itʼs a simple thing, but itʼs a barrier that most people throw up their hands when encountering.

Itʼs also funny that when debugging things interactively, via point and click stuff, lazily expanding, that you donʼt care too much whether the data structure is cyclic or shared: nothing will blow up since it does not recurse automatically.

You can imagine an algorithm (well, I could write it if I wasnʼt sleepy) that pickles JSON-like objects, but in a way that respects mutability and sharing. It would write out a function definition to reconstruct a new mutable object equivalent to the one it was given.

In JavaScript you would have to do this the slow way. You would maintain a list of **all** mutable objects you have seen, along with where you saw them as a path down from the root object. You would then output code that reconstructs an object incrementally, by adding properties in order, and grabbing shared references from other parts of the object as necessary (or caching them in variables).

The cool thing is that you can do it all with `const`

! You donʼt need mutability at the variable level, you can (and should) do it all at the mutable value level.

The algorithm I decscribed can do this for data that is simple in structure, but complicated in terms of references, and with extensions it could handle more things (like regular expressions would be easy to add, `undefined`

would be trivial). Actually, it is funny – it would also need to be extended to handle sparse arrays, and all of these little details tell you about how simplified JSON is from the actual data model of JavaScript.

This would give you your own faithful, accurate slice of the runtime heap as viewed from the perspective of one objectʼs watershed of references. The resulting reconstructed value would behave the same with regards to mutability of its children. It just would not compare equal with `===`

, since it is a newly allocated value (and all of its children are too).

However, if nobody else remembered the old object, and you substituted in the new object very sneakily … nobody would know 🤫

Python is pretty similar to JavaScript, in the rough kinds of mutable data it supports, but worth talking about separately.

It has more explicit boundaries around mutability and immutability in its data types. (Although still not as nice as Haskell. And I suppose JavaScript has been getting a little more in the way of immutability and actual data types.)

Python also provides the pickling library that is one of the main subjects of this article. More on this later.

Some wrinkles:

The fact that the hashing function rotates each run is really interesting! It technically is an observable difference between runs, but it isnʼt some essential semantic feature of the data. And you would have the same sort of thing if you are allowed to see an ordering on pointers.

External references, like files and sockets and other stuff – talked about elsewhere.

Regexes are really interesting. Theyʼre pointers to foreign object (compiled regexes in some library implementation). But they can be reconsistuted into equivalent objects very easily.

Itʼs worth getting to know the `pickle`

moduleʼs capabilities and limitations, so I will just copy and paste the juicy bits here:

The

`pickle`

module keeps track of the objects it has already serialized, so that later references to the same object won’t be serialized again. […]This has implications both for recursive objects and object sharing. Recursive objects are objects that contain references to themselves. […] Object sharing happens when there are multiple references to the same object in different places in the object hierarchy being serialized.

`pickle`

stores such objects only once, and ensures that all other references point to the master copy. Shared objects remain shared, which can be very important for mutable objects.

`pickle`

can save and restore class instances transparently, however the class definition must be importable and live in the same module as when the object was stored.

Note that functions (built-in and user-defined) are pickled by fully qualified name, not by value. [2] This means that only the function name is pickled, along with the name of the containing module and classes. Neither the function’s code, nor any of its function attributes are pickled. Thus the defining module must be importable in the unpickling environment, and the module must contain the named object, otherwise an exception will be raised. [3]

Similarly, when class instances are pickled, their class’s code and data are not pickled along with them. Only the instance data are pickled. This is done on purpose, so you can fix bugs in a class or add methods to the class and still load objects that were created with an earlier version of the class. If you plan to have long-lived objects that will see many versions of a class, it may be worthwhile to put a version number in the objects so that suitable conversions can be made by the class’s

`__setstate__()`

method.

This last quote raises an important point: there is some aspect of intent when restoring data to a new runtime. Just because you named the class or function the same, does not mean it is the same class or function! But it is a good marker of intent and worth preserving.

I actually donʼt know a whole lot about Go or Java.

But they need some structure, at least for garbage collection purposes!

“Heap layout”.

Basically every runtime *at least* needs to keep track of which piece of data is a pointer or not.

dunno?

Uh, yeah. Good luck.

Pointers “are” numbers? What the fuck?!

Clearly thereʼs nothing much we can say about coherent semantics … without getting really deep into the weeds of what is and isnʼt undefined behavior and why.

However, it does reinforce the point: at a very very basic level, OSes and memory management and stuff are about managing the graph of live pointers – it is just very very hard to determine what bytes are actually live pointers at any given point in a C program, and what bytes are other kinds of data.

Yeah, it gets its own section and backstory!!

Nasal is a small embedded scripting language. Its name stands for “Not another scripting language”.^{13} Its only notable use is in the FlightGear open-source flight simulator, although AlgoScore and a tiny handful of other tiny projects use it.

Its data model is basically JavaScript, but simpler and better. (Arrays are their own data type, are not allowed to be sparse, and you can actually iterate over collections in a sensible manner. Good riddance to Lua and JavaScript. Ugh.)

It has some metaprogramming facilities by default, plus I prototyped some more of my own, including full bytecode decompilation.

Finally it has this one special function: `id(obj)`

. It returns a string representation of the (stable) pointer for any object!

```
>>> id([])
'vec:0x7fea11014c40'
```

I mean, I guess it is like the `id()`

function in Python … Yeah, both use mark/sweep GCs, so pointers are stable.

Anyways, the other great thing about Nasal is that objects donʼt have constructors! It is so liberating.

Pickling consists of writing out a file that, when executed, returns an equivalent object. (The body of a file is simply a function. Plus all statements are expressions – they have a return value, although sometimes it is a pretty useless return value.)

- You initialize a hashmap of object ids that have been seen.
- For each object you see, you look at the hashmap:
- If not, you add the reference to the hashmap, and add a variable to the file to save the reference in case you need it later.
- If it exists, you insert code to reference the existing variable and stop walking the structure.

- For non-recursive data, you just set it directly.
- For recursive data (objects and arrays), to handle cyclic references you may need to initialize it to an empty value and add items via mutation. Thus when those items mention their (grand)parent, that reference already exists in a variable, and the rest of the structure will continue to be built as necessary.

In lieu of a hashmap, you could even use a list of objects, and traverse it in linear time to compare referential identity. This is possible in most dynamic languages, just really slow.

I also tried to work on a bootstrapping system for Nasal. I never completed it.

Anyways, this is relevant because I did add bytecode decompilation for functions. You could already inspect bound scopes of closures, and callers and local scopes of the call stack. All that was left was builtins (which donʼt have bytecode).

If you have a contract with the bootstrapping system, you could look up the globally accessible names for the builtin functions that you could not decompile, and hopefully assume that equivalent builtin functions would live in those same spots on the next bootstrapping too.

The problem, still, is builtin external references, like files and such. Some could be supported on a case-by-case basis, but not everything.

Also, ideally you would airgap^{14} the builtins, since some builtins are fake builtins. That is, they are wrappers over actual builtin functions, but you could only access those builtins through the closure of the wrapped function – so you might as well stop bytecode decompilation for the wrapped functions (by wrapping the decompiler!) and treat them as builtins.

Anyways, in theory you would almost be able to save the entire state of a running Nasal system and fully reconstitute it under an equivalent bootstrapped environment. At the very least, you would expect to be able to save complicated data with simple functions.

The pickling process points towards a method of determining equivalence. Obviously you should sort the file in some semantic way, and rename variables to less arbitrary things. Maybe normalize the bytecode for functions.

After that, you should just be able to compare your resulting files and use that as a notion of equivalence!

Alternatively, you could write out a direct algorithm: take two objects at runtime and walk them recursively with the same kind of hashmap trick, comparing at which paths you see the objects, and then just make sure the shared references appear at the same minimal-paths from the root across their sharings. (You want to avoid cyclic recursions, of course, which does mean you will only look at minimal paths.)

Iʼll call this … “graph equality”? “Stable equality”? It is what Iʼve meant by “equivalence” all along.

Nasal has a concept of “ghost” objects which are pointers to foreign objects, literally just C pointers with some associated data to help the garbage collector.

These are constructed by C APIs, and the only way to reconstruct them would be if you can call those APIs again to produce equivalent objects – which may not always be possible.

One of the main foreign interfaces of Nasal is FlightGearʼs property tree – a central store of important simulator values that is accessible from all of FlightGearʼs subsystems, not just Nasal. References to these nodes can be stored (making use of ghost objects) and manipulated through an API, in a way that is somewhat like accessing the DOM in JavaScript.

This is one type of ghost reference that could easily be handled by a pickling script: since the path of the node is available, you just have to serialize that, and then obtain the node again when deserializing. However, there still is some rough edges: what if the node doesnʼt exist when it is getting loaded again? It could be possibly recreated, but now the deserialization has side-effects, which is weird. Or a separate property tree could be created to sandbox the script, and relevant nodes created there.

Property trees can also be saved to XML and loaded from there, although there are various details that donʼt translate well regardless of serialization format. One example is properties that are managed by C++ code, instead of having their data be managed by the property tree – but those properties typically exist by the time Nasal is initialized.

So thereʼs always some details that need to be figured out or approximated when dealing with external APIs and data that is not managed in the language itself.

Nasal strings were actually a fun challenge. It has mutable strings!

It has immutable interned strings, which are cached in a global lookup table. This is used for identifiers (including object keys), to speed up comparisons.

It also has mutable strings (and I believe they can be mutated to be immutable? it is a little weird).

The referential identity of strings is not exposed – the equality operator ignores it. However, you can still determine whether two mutable strings are the same reference, by using test mutation: if you mutate one and the other stays the same, they are different references.

(You can even use missed fast comparisons to determine if a string is interned or not.)

The main thing I want you to take away is that dynamic runtimes donʼt have to be scary places filled with spaghetti data flying around all over the place. Itʼs actually possible to tame mutable references in many ways!

The concrete lesson is that there are three useful notions of equivalence, that are used for characterizing what levle of abstraction of “data” we want to be looking at, and I think the middle one is much more important than we give it credit for:

- Referential equality, which treats everything as live, mutable references, but is too fine-grained and doesnʼt make sense across restarting the program. This is the notion of equality that your runtime (and particularly its garbage collector) is tasked with preserving for your code.
- My new notion of equivalence, which I will call “graph equality of mutable data”, which keeps track of shared mutable references and so on.
- The notion of “deep equality” of objects, which treats them mostly as if they are immutable data. (I didnʼt talk about it at all, whoops, but I assume you are familiar with it.) It can be very useful, but it acts a lot like traditional serialization, and isnʼt comprehensive enough to actually probe the whole of a running system.

So while referential equality forms your basic data model of a language, I encourage thinking about equivalence. **If you could swap out two objects completely** (including *existing* references to them), **would you be able to notice?**

And then you need to keep abstracting away what you care about. Do you care about exact hash values? Do you store those hash values in a way that would make reconstruction fail to mean the same thing with a different random seed? And so on.

So thereʼs still domain-specific work to be done, as there always is.

But if we can expose the underlying graph of referential relationships, we have a much much MUCH larger toolbox for working with data and data serialization.

Is (shallow) referential equality the best we can do? What about deep (explicitly non-referential) equality?

Emphatically no – but, it is probably not worth it.

See, we could have the notion of equality that pickling and unpickling preserve. Graph equality, where the sharing of mutable references is tracked and tabulated by their role, instead of exact referential identity.^{15}

If you have two data graphs where pointers are shared in equivalent ways, sure, they could totally be considered parallel universes and interchangeable amongst themselves. (Obviously if something external holds references to them and you donʼt have a way to swap them out, this can break.)

The only problem is that it is pretty expensive, it requires a *lot* of bookkeeping, and most people generally donʼt care – they are fine either writing the equality comparison they need, or settling for the standard deep equality.

However, it is very useful!

Like, as one example, this is literally what stable names are used for.

Runtimes are literally built on graphs!

We want to be able to touch this. To expose it, to hold it in our hands. To work with it, to meld it to our own needs. To chart our own course through the graph, traversing references and recording where weʼve already visited.

We can have very nice things if we give up the dichotomy of referential identity versus deep equality, and embrace the graph nature of runtimes.

As part of these musings on data, I subscribe to the idea that the only objects that should have constructors (at the level of data – obviously client code will want different abstractions) are objects that are constructed from external references, FFI.

Idk, constructors in the sense of mainstream OOP are mostly a distraction for this view of data I want to talk about. They just arenʼt good, arenʼt necessary, they get in the way – especially since the arguments to the constructors donʼt have to correspond to the runtime data of the object at all.

Miscellaneous thoughts that donʼt belong in the conclusion but do belong at the end.

You really should not be able to ask for two references to be ordered against each other: it doesnʼt mean anything with regards to the *meaning* of the program (although it may record some historical data about how the program *did* happen to run). But you kind of should be able to put them into a map efficiently, and ordering/hashing is potentially important for that, but only if it can be stable.

Mark/sweep GC is good for stability of pointers (and thus comparisons). I think mark/compact still can preserve comparisons.

Weak references are interesting. Every time Iʼve wanted weak references, Iʼve always actually wanted to do them as reverse references: data stored on the key object, instead of in a map indexed by the key object. (Of course this may leak memory if the map you want to store is not long-term/global.)

I think itʼs instructive to pin down a model of garbage collectable data in Haskell/PureScript, where we can talk about references separately from pure data structures.

This is enough to model the fragment of JavaScript values I said should be covered by the pickling function I sketched. (Well, you could easily add `undefined`

.)

```
-- The managed heap for the runtime data.
-- I believe this is what rustaceans
-- call an arena?
newtype Heap s metadata shape = Heap
(MVector s (RuntimeData shape, metadata))
data RuntimeData shape
-- Runtime data is given by a pure shape
-- (which needs to be `Traversable`!)
-- which contains runtime references
-- in a known way
= RuntimeData (shape RuntimeRef)
-- It can also be an external “ghost”
-- reference that we have a function
-- to destruct (or dereference, if
-- it is shared data)
| ExternalGhost Ptr (IO ())
-- A managed reference we control,
-- thus it is an opaque pointer
-- into the opaque memory heap
data RuntimeRef
= ManagedOpaque Int
deriving (Eq, Ord)
-- ^ the user is allowed `Eq`
-- but not `Ord`
-- An example shape for mutable data
-- in the spirit of JSON (the only
-- reason it is not JSON is that JSON
-- is immutable, being a serialization
-- format, strictly speaking)
-- A JSON value, either a plain value
-- or a reference to a mutable value
data JSONValue ref
= Null
| Number Scientific
| String Text
| ByRef ref
deriving (Eq, Ord, Functor, Foldable, Traversable)
-- What data lies behind a mutable value?
data JSONShape ref
= Array [JSONValue ref]
| Object [(Text, JSONValue ref)]
deriving (Eq, Ord, Functor, Foldable, Traversable)
-- If we take the immediate fixpoint,
-- without mutable references in the
-- loop, we get plain immutable JSON data,
-- except that it is lazy, so it is
-- potentially infinite
newtype JSON = JSON (JSONValue JSON)
data Idx
= ArrayIdx Int
| ObjectIdx Text
-- A machine for creating a graph in
-- the mutable JSON structure
data Machine ref
= SetKey ref Text (JSONValue ref) (Machine ref)
| Push ref (JSONValue ref) (Machine ref)
| Get ref Idx (JSONValue ref -> Machine ref)
| NewArray (ref -> Machine ref)
| NewObject (ref -> Machine ref)
| Return (JSONValue ref)
```

Now we can talk about the concepts from above.

```
equal :: Eq ref =>
JSONValue ref ->
JSONValue ref ->
Boolean
equal = (==)
-- Since Haskell is lazy,
-- JSON is a greatest fixpoint,
-- so, with some care, I believe
-- you could even reify recursive
-- data into the JSON type
-- (but `Eq` would not terminate)
snapshot ::
(ref -> m (JSONShape ref)) ->
JSONValue ref -> m JSON
deepEq ::
(ref -> m (JSONShape ref)) ->
JSONValue ref ->
JSONValue ref ->
m Boolean
deepEq read x y = do
m <- snapshot read x
n <- snapshot read y
-- compare as JSON
pure (m == n)
equivalent :: Ord ref =>
(ref -> m (JSONShape ref)) ->
JSONValue ref ->
JSONValue ref ->
m Boolean
equivalent read l0 r0 =
runStateT (comparing [] (l0, r0)) empty
where
comparing ::
[Idx] ->
(JSONValue ref, JSONValue ref) ->
StateT (Map ref [Idx], Map ref [Idx]) m Boolean
comparing path (l, r) =
-- Try to match up the values
case zipMatch l r of
-- They are both references
Just (ByRef (ll, rr)) -> do
-- First we check if we should short circuit
-- if they have been seen before
(seenL, seenR) <- get
case (lookup ll seenL, lookup rr seenR) of
-- Seen both
(Just p1, Just p2) -> do
-- We have to have seen them at the same path,
-- since we are traversing in the same order
pure (p1 == p2)
-- Seen neither
Nothing, Nothing -> do
-- Keep track of where we saw this
-- reference, separately for left and right
-- (since it is valid for both to use the
-- same reference for different purposes)
modify
(insert ll path *** insert rr path)
-- Read the current values of the reference
x <- lift (read ll)
y <- lift (read rr)
-- Try to match them up
case zipMatch x y of
-- Failed: different types
Nothing -> pure False
-- Similar arrays: recurse into children
Just (Array xys) ->
allM (uncurry comparing)
-- Add the index onto the path
(intoArray path <$> enumerate xys)
-- Similar objects: recurse into children
Just (Object xys) ->
allM (uncurry comparing)
-- Add the key onto the path
(intoObject path <$> xys)
-- Failed: seen one but not the other
_, _ -> pure False
-- Succeed if it is a pure value
-- that is equal on both sides
lr -> pure (isJust lr)
intoArray path = first $ \idx -> ArrayIdx idx : path
intoObject path = first $ \key -> ObjectIdx key : path
```

Hopefully you can see from the implementation of the `equivalent`

function how two distinct references can still be interchangeable for all intents and purposes. We could write out this interchange formally, since all the references are visible on the heap, and then state some theorems about some functions.

However, for other processing, we donʼt even really need this arena/managed heap. We can use the Haskell runtime itself!

I havenʼt worked out the details (stable names?), but we should be able to reify the graph of a cyclic `JSON`

value too.

The main difference is that, in Haskell, the infinite `JSON`

could be truly infinite (like, procedurally generated) – it does not need to be backed by a finite amount of data like it would be in JavaScript.

If you go far enough down the rabbit hole, it turns out that you want semirings for static analysis. This is not unheard of in compilers! Itʼs a really good technique for analyzing control flow, for example: information about exclusive branches are combined additively, and information from sequential operations are combined multiplicatively. It is especially appropriate because, semantically speaking, you want those sequential operations to distribute over the branching.

(You can already see this in more typical typeclasses like `Alternative`

, which is naturally analyzed by taking `<*>`

to `*`

and `<|>`

to `+`

. Itʼs just that Iʼm interested in augmenting `Selective`

to encode exclusive choice too.)

This led me to come up with this construction: how to make a semiring out of a semilattice.

This construction answers the question, “if you need a semiring for static analysis, how do you also keep other data around that does not care about the branching structure?” (like, say, a monoid).

Specifically in selective applicative parsers, I need it to answer the question of why aggregating information about the grammar is a valid thing to do across parser combinator segments, no matter how they are combined.

And in the compiler pass I was doing, I was implementing demand analysis via semirings (especially the min tropical semiring). I actually donʼt have specific information I was considering aggregating as a semilattice, but it was a possibility that might come up, especially if I want to fuse some passes together. Right now my one pass is really three traversals of the tree, with various monad stacks of reader, writer, and state. (Yes I used all three.)

I don’t know if this construction is well-known! Let me know if you have a reference for it.

You can make a semiring out of a semilattice by adjoining a new zero element. Lifting the semilattice operation in the two obvious ways gives you `+`

and `*`

. Idempotence gives distributivity(!).

(Bounded) semilattices are commutative monoids whose operation is also idempotent: $x \diamond x = x$ for all $x$.

I will write the monoid operation as `x <> y`

and as $x \diamond y$, and the empty element as `mempty`

or $e$.

Semilattices have deep connections to order theory: they induce a really nice preorder given by $x \leq y$ when $x \diamond y = x$ (or vice-versa, depending on whether you are talking about meet or join semilattices – and no, I cannot keep them straight 🏳️🌈). But we donʼt need the order theory here.

Semirings are rings without subtraction: just addition and multiplication and their identities, zero and one, respectively. And distributivity and annihilation laws to intertwine these two monoids.

The funny part of this is that “semi-” means different things: semirings are just missing subtraction (kind of a weird use of semi, which is why some call them rigs), but semilattices are literally half of a lattice (one idempotent commutative monoid instead of two interlinked).

(Lattices are actually closely related to semirings: they have the same shape of operations, and you can turn every bounded *distributive* lattice into a semiring – in two ways, in fact, since you can make a lattice with the opposite order.)

So itʼs like a mathematical joke that they can be related to each other at all!

How do we get two monoids out of one??

The key idea is to adjoin a zero. Thatʼs it.

The rest of the moves can be determined from that premise, so letʼs see how it works:

```
data WithZero t = Zero | NonZero t
-- Imagine that `t` is really a `Semilattice`
-- (this does not exist as a class in PureScript)
instance Monoid t => Semiring (WithZero t) where
zero = Zero
one = NonZero mempty
add Zero Zero = Zero
add Zero r = r
add l Zero = l
add (NonZero l) (NonZero r) = NonZero (l <> r)
mul (NonZero l) (NonZero r) = NonZero (l <> r)
mul _ _ = Zero
```

The two operations are the semilattice operation lifted through `Maybe`

in the two possible ways:

`add`

follows the pattern of the default`Semigroup a => Monoid (Maybe a)`

instance, that uses`Nothing`

as its identity. This makes sense since weʼre adding`Zero`

, the identity for`add`

. Indeed, it is forced by the laws, and the fact that we only have one binary operation to use.`mul`

is like the other possible instance,`Monoid a => Monoid (App Maybe a)`

, formed by`(<>) = lift2 (<>)`

. This is likewise forced by the annihilation law and the fact that we only have one binary operation to use.

Iʼm going to be lazy and use math notation for the laws, with the understanding that when I say $x \neq 0$ for example, it means in Haskell/PureScript that `x = NonZero x' :: WithZero t`

for some unique `x' :: t`

, and if $x, y \neq 0$ then $x \diamond y$ means `NonZero (x' <> y')`

.

The fun part is the left and right distributivity laws:

To prove left distributivity, $x * (y + z) = x * y + x * z$, we look at some cases:

- If $x = 0$, then we have $0 * (y + z) = 0 = 0 * y + 0 * x$.
- If $y = 0$, then we have $x * (0 + z) = x * z = x * 0 + x * z$.
- If $z = 0$, then we have $x * (y + 0) = x * y = x * y + x * 0$ similarly.
- So now we can assume that all three variables are nonzero. But that means we fall back to the underlying semilattice operation: $x \diamond (y \diamond z) = (x \diamond y) \diamond (x \diamond z).$ But by commutativity and associativity, $(x \diamond y) \diamond (x \diamond z) = (x \diamond x) \diamond (y \diamond z).$ And finally we finish off with idempotence: $x \diamond (y \diamond z).$

We prove right distributivity in the same way $(x * y) + z = x * z + y * z$, based on the calculation $(x \diamond z) \diamond (y \diamond z) = (x \diamond y) \diamond (z \diamond z).$

The takeaway is that **idempotence of the semilattice gives us distributivity of the semiring**. This is why having a semilattice and not merely a monoid is essential.

This does make some sense: if weʼre aggregating information that does not care about branching structure at all, well, semilattices are great models for accumulating knowledge. Idempotence says you only learn a fact once.

We donʼt require multiplication to be commutative, so if you drop the left-distributivity law, you could get away with a right-regular band with the law $x \diamond y \diamond x = y \diamond x$.

I think left-distributivity is a bit weirder than right-distributivity, in the context of control flow. Right distributivity just says you can copy any training code into each case branch.

However, in general Iʼm a fan of left-regular bands, since they intuitively preserve order.

Also, to be fair, you could absolutely disregard some semiring laws for the sake of static analysis of programs: you donʼt always want to treat programs as purely algebraic structures, and often want to dig into the details of how they were constructed.

Like, if youʼve factored out common control flow, thatʼs almost always for a reason! So your static analysis should reflect that.

We made this true by definition of `mul`

: $0 * x = 0 = x * 0$.

We also made $0 + x = x = x + 0$ true by definition of `add`

.

So we just need to prove that $x + (y + z) = (x + y) + z$ for $x, y, z \neq 0$. But that follows from the semilatticeʼs associativity: $x \diamond (y \diamond z) = (x \diamond y) \diamond z$.

Yes.

For $1 * x = x = x * 1$, we need two cases: $x = 0$ and $x \neq 0$. But if $x = 0$, it is trivial still. (This is the nice way the identities and annihilator elements interact. They donʼt add any proof burden to the other.)

So for $x \neq 0$ (and since $1 \neq 0$ and $1$ is given by the semilatticeʼs identity $e$), we look at the underlying semilattice and find that $e \diamond x = x = x \diamond e$ as we want.

Same case analysis as usual: if $x, y, z \neq 0$ then we get associativity from the semilattice, otherwise both sides equal $0$ by the power of the annihilator.

Yes.

Note that we cannot make a lattice out of the semilattice – thatʼs a step too far. Intuitively from the order theory point of view, thereʼs no reason why would would be able to, since the meet and join operations of a lattice have opposite views of the preorder of the lattice.

And algebraically, the two absorption laws would fail in general: $x * (x + y) = x$ and $x + (x * y) = x$ (even stating them like that looks weird). For $x \neq 0$, by idempotence of the semilattice we would see $x \diamond (x \diamond y) = x \diamond y$, which only equals $x$ if $y = e$. Thereʼs just no way to get rid of the extra $y$ there if we are sticking to one operation.

You could technically iterate this construction, since `add`

and `mul`

are both idempotent, commutative, associative operations now. However itʼs not terribly interesting.

You end up adjoining some number of identities and annihilators to the underlying semilattice. (New top/bottom elements, depending on which way you look at it.) The order that you do this in does not matter, only how many times you choose to do each way.

Want a semiring without zero? No need to adjoin a zero, then – just use the same carrier type. The remaining laws still just work.

For static analysis, the zero is only good for representing unreachable/error cases. But the identity of the semilattice is indispensible: itʼs the empty analysis for when you know nothing yet or have nothing to contribute.

Important to note that all of these algebraic constructs (monoids, semilattices, semirings^{1}) are closed under taking products. This is why I said “how do you *also* keep other data around” in the introduction.

The concept of a semiring is an abstract conception of what a number is. A particular semiring is a specific conception of what can be a number. We can manipulate these “numbers” in the familiar ways – mostly.

]]>Iʼve been dreaming of making my own metalanguage for writing type theories for many years now. I havenʼt implemented much yet, but Iʼve been refining my ideas. Hereʼs a tiny taste:

*What you write:*

```
// If we are casing on a known boolean, we know which case to choose
normalize/ ["if" "true" "then" exprT "else" exprF] => exprT;
normalize/ ["if" "false" "then" exprT "else" exprF] => exprF;
// If both branches have the same value, the boolean is irrelevant
// This is an example of non-linear pattern matching, which will get desugared
normalize/ ["if" _cond "then" expr "else" expr] => expr;
// Fallback case, including most other nodes not specified
normalize/ layer => normalized:
// ^ We compute the result, `normalized`, by following these operation(s):
// Recursively calling normalize on each child node of this node
// (that's it, that's the only operation in this case, but there could be more)
map normalize layer => normalized
// ^ okay this line probably needs tweaking for type inference ...
```

*What it means:*

```
normalize (Roll (IfThenElse metadata cond exprT exprF)) =
case normalize cond of
-- Note that "true" above is allowed to stand for the literal in the AST
-- (as well as the Boolean type in tmTTmt itself), but in Haskell we need
-- an explicit constructor `BooleanLiteral` to embed it in the AST:
BooleanLiteral True -> exprT
BooleanLiteral False -> exprF
condN ->
let (exprTN, exprFN) = (normalize exprT, normalize exprF)
in case areEqual exprTN exprFN of
-- Every time we equate two expressions, especially though matching on
-- the same variable name twice, we return a unified node, so that we
-- have a chance to merge *metadata* when the underlying *data* is
-- the same. In typechecking, unifying can also unify metavariables,
-- through algebraic effects or something.
Just exprN ->
exprN
-- We fall back to the `map` operation, special cased for the node we
-- already matched here. We can pass the metadata through to the result,
-- or update it -- but how?? It will be specified in other places,
-- e.g. through typeclasses and auxiliary functions ...
Nothing ->
Roll (IfThenElse (updateMetadata?? metadata) condN exprTN exprFN)
normalize (Roll layer) = Roll (map normalize layer)
```

I want to skew to the left of many design choices that have been made for most languages, aiming for very specific tradeoffs to achieve aesthetics and functionality.

My goal is to have a syntax that is straightforward for humans to write and easy for computers to interpret.

But I donʼt want this process to be magic! I just want it to look convenient for writing powerful systems of type theory.

AAaaahhhh

My problem with existing programming languages is that they are too heavily tied to a fixed logic of computation: ADTs in Haskell are great (especially for language design!), at least right up until you need to add extra data to your ADTs, and now that infects your whole program with ugliness and bookkeeping.

In general, **this distinction between data and metadata** is so *crucially* important to me. And not something that is really considered in any typed programming language! Data has always gotten tied directly to the logic you are working in, and metadata was given no freedom to roam around. So letʼs unleash them. Let it loose :3

As a very concrete example of this, imagine caching which variables are present in each term (such as Director strings, or a simple extra `Set Name`

on each node). This can be used to skip allocations when you know nothing is changing in operations that target variables. But now you need to keep track of that information literally everwhere you construct nodes in your AST! Despite it being a really simple algorithm to compute on its own, one that hopefully could be incrementally updated in most cases.

As another example, source spans are really tricky to keep around and get right. (I have thoughts on that – that we shouldnʼt be using source spans! – but thatʼs beside the point.) But imagine if you didnʼt have to do any work to keep them around: the logic of tmTTmt could keep that information around, and generate empty source spans for nodes that are inserted by the compiler. (With some way to optionally borrow source spans from other node(s).)

As a third example of why we should separate data and metadata: if we keep the identity of nodes separate from their raw data, we can keep track of which source terms interact with each other. The better you can keep track of source spans and *provenance*, the more reliable this system will be. If you keep track of which types are unified with each other, and can map them back to locations in the source, it could even tell you all of the places you need to edit if you want to change the type of something (or where you need to insert conversions).

If you arenʼt forced to work in the base logic of Haskell, and instead have more freedom to come up with a linguistics and logic of type theory design itself, youʼll get several benefits:

You donʼt have to rewrite your whole compiler to introduce optimizations, like I mentioned above.

You could generate graphical representations of the rules, from the exact same source that the program is derived from. This would be fantastic for interactive typechecking rules, which could enhance error messages with the particular rule that failed to be derivable, and allow you to search back through the parts of the derivation that did succeed.

You may still want separate sets of rules for conventional “paper” presentations, however. Like as a term rewriting system, instead of NbE, for example. But if both of them are executable, you can test whether they are equivalent! (With QuickCheck, unit tests, or assertions.)

Iʼve been ranting about tmTTmt on cohost as an outlet until I have the time to actually write the darn thing: @monoidmusician/#tmttmt.

The real genesis of the project was when I was attempting to write Dhall-PureScript (many apologies for dropping the ball on that one). I wanted to go all in on extensible row types and recursion schemes. I think theyʼre awesome tools, but they proved too impractical to use in PureScript, since they wrecked the existing pattern matching facilities of the core logic of PureScript. I have also learned a lot in the meantime (e.g. about parsing, about languages like Erlang) and developed a lot more ideas. I think Iʼm ready to make it happen!

- Simple syntax, which can be easily interpreted by other programs in other ways.
- Type-directed shenanigans. (Typeclasses, mostly. Also type-directed syntax sugar.)
- Algebraic effects, or something. This is necessary for lightweight unification.
- Compilation steps to get from that syntax to some core logic/runtime in some language.
- Desugaring nonlinear patterns into appropriate equality/unification steps.
- Desugaring core logic into monads/applicative(/selectives?)
- Inlining; removing redundant steps.
- In particular, it will be the expectation that these operations are safe for any custom monads/effects that users use. As an example, resolving imports in Dhall requires doing disk access and network requests, but those network requests are cached during resolving, so their resolution is idempotent and redundancies can safely be removed. (And obviously if there are URLs that do not have to be resolved, thatʼs great for efficiency! Although you can argue about safety, which is why these things need to be customizable.)

- Personally I think it would be fun to target PureScript, JavaScript, Erlang … very different needs across each of those.

- Functors! I love functors.
- Encourage healthy abstractions. I think thatʼs a great word:
*healthy*abstractions. - I have this idea for a type system and I donʼt know if it will pan out … Something like TypeScript done better (or similar sorts of ad-hoc type systems).
- Easy debugging and decent dev UX. Being able to dump terms in a representable/inspectable format. Being able to trace execution and focus logs. Flags to enable/disenable features. Assertions. Idk.

- Lightweight literals, type-directed.
A literal is a string or a list of literals or variables. (Basically reinventing lisp lol.)

Types are basically patterns of literals, meaning literal singleton types plus arrays and unions of types, like TypeScript but hopefully in a sensible way. Thus it is obvious what type a literal has, and then this can be subsumed by other types.

There are also nominal types; still figuring out the details there. The main goal is to mostly get rid of newtype wrappers, so you can just match on the constructors-as-literals you want through all the cruft. But type annotations will still be necessary to disambiguate types in some cases. And full type annotations are usually tedious, so some system of named coercions may be helpful.

In particular, by committing ourselves to

*this*logic of**literals as ground truth for comparing**, we can generate automatic coercions between subsets of complex types.*across types*I understand why a lot of languages want literals to have distinct types (e.g. Haskell ADTs all have distinct, named constructors), but it just poses a barrier to the fluidity I want to have in this system for language design of all things. If you name something

`["if" _ "then" _ "else" _]`

then you know what it represents! No matter if it is in the source CST, the desugared AST, or a final core pass …In some target runtimes, if they are faithful to the literals, these will be actual zero-cost coercions. However, because the expectation is that compilation is type-directed and enough type information is available to insert conversions as necessary, there is no reason that they are required to be coercions in implementation.

Restriction types, of nominal types constrained to fewer possible cases, would be incredibly useful.

tl;dr is that this should help with the “trees that grow” problem of multiple related ASTs.

Iʼm wavering on including records: I think they donʼt mesh well with the system of inference. But there is an alternative, which is to include sort of “grab bags”: where you donʼt pretend to know the totality of the record (in particular, there is no sensible way to implement

`Eq`

for records), but you have some partial knowledge of what you want to be in there.In concrete terms, this means that inclusion in the grab bag is the only sensible constraint you get to ask for; you donʼt really get to “delete fields” or “merge” or such.

Avoiding row types … idk. Row types are great but I think there are enough alternatives in this type theory that they would not be so important. In particular, having union types (and maybe restriction types) means that you can talk about parts of the AST.

If I did have row types, I would want to make sure they are not limited to existing constructs of records and variants (product and sum types), there are so many other symmetric tensors to think about! E.g. configuration options tend to come as a record of maybes, but sometimes you need a bunch of things tensored together with

`These`

, so you know that at least one is specified.

- Function calls cannot be nested, functions are only applied to literals/variables.
- This is for two reasons: it makes the syntax lighter, and it means that the user was very specific about the order of execution.
- One concrete benefit is that you need much fewer delimiters in the syntax, since each pattern ends at a well-known point.

https://cohost.org/monoidmusician/post/3252802-first-class-patterns

We need a way to reflect patterns into values, filling in any variables with default values. This is most useful to implement unification: to unify a term with a pattern, you first replace the variables with unification variables, call the unification function (which has no idea what a pattern is), and then match the pattern against the result.

So if you want to unify

`T`

against`["tuple" x y]`

, you first generate two unification variables`U1`

and`U2`

, then run`unify T ["tuple" U1 U2] => R`

(if`T`

is a unification variable, this will write into state that it is now known to be a tuple!), and finally do regular pattern matching of`R`

against`["tuple" x y]`

, binding`x`

and`y`

to the respective subnodes of`R`

.Iʼm not quite sure if this deserves to be called first-class patterns. To be honest, Iʼm not sure what first-class patterns would even mean! But it is very simple, practical, and it does all the things I personally would want out of first-class patterns.

It is a technique I have also been using in more and more places: in my LR parser combinator framework, and in writing a compiler.

The basic idea is that a

`Pattern`

or`Matcher`

(or whatever you want to call it) is a combination of the shape that it expects (minus actual data), and then what to do once it receives that shape (with the data filled in, producing some arbitrary result). You can combine these applicatively and with other tools (if you can combine the shapes and pick them back apart); it is very useful, even without any language support whatsoever, just DSLs and fancy types. These are basically codecs in a single direction (not bidirectional).

Non-linear pattern matching.

Static evaluation of functions, so they can be considered as macros producing patterns.

This means that they have to reduce to a pattern, without running any effectful functions, and cannot have stuck case matching, and so on.

Pattern aliases?

Writing down programs in the style of judgment rules for different flavours of typechecking (unification and bidirectional) and normalization (rewrite systems and normalization by evaluation).

Optimizing these algorithms by applying transformations from the source code to add ~things~. And using the same techniques to add additional metadata, for nice features.

This is my problem with a ton of code that gets written, and I am certainly guilty of it too: we get the fundamental logic written, but never get over the hump to the point of providing nice features, in part because the languages/libraries we use do not facilitate the nice features – heck, they even made writing the logic so arduous in the first place that we run out of steam – and partly because, as I have mentioned repeatedly, it would mean touching half of the codebase again just to add fields that none of the existing logic cares about.

Urgh. Letʼs find ways to do better, and create the tools that reflect our values.

… Anyways, back to those use-cases:

Trace evaluation of these programs and generate interactive visualizations based on this.

Generate types for precise errors based on analyzing failure modes of the written logic.

Working with multiple related AST types. Working with types related in other ways, such as non-empty constraints. (These get pretty onerous to work with, when you have to write completely separate types for non empty things, and make sure you preserve non-emptiness in the right places. Trust me, Iʼve tried!)

Simplify writing advanced programming techniques:

- Continuation Passing Style (CPS). This is (apparently) really great for efficiency for a bunch of reasons (e.g. quicker backtracking), but can be mind-bending to write directly.
- Deriving zippers/one-hole contexts for datatypes, possibly even Clowns & Jokers style for incremental stack-safe computations. (One-hole contexts are possible to derive generically with typeclass machinery. But the conversions get super annoying…)
- Functional Reactive Programming (FRP). Existing FRP frameworks are … alright. But none really capture the right logic/linguistics to make it easy.
- Incremental computation. I mean … just imagine an incremental compiler, where trivial refactors donʼt cost any time, changing constants in the source code changes them directly in the compiled artefacts, and other tasks scale proportionally to the amount of things they actually affect.
- “Free” constructions (I mean, minus laws, since we donʼt have quotients). These are just so difficult to make, with a lot of boilerplate.
- Codecs. Parsing. I love parsers so it would be great to integrate them. Maybe even into the type theory! (It is apparently possible to algorithmically decided whether one regular expression is contained in another uwu :3.)
- STM. Eventual consistency. Other lattice-y stuff.
- Parallel evaluation, à la
`unamb`

or so.

- Not intended to support dependent types or any theorem proving features.
- This is not intended to be a logic language, although it could be compiled to a logic language. Thus we will not expect to be doing proof search during execution. (Arguably could be doing proof search during compilation.)
- Similarly: not interested in baking in unification. That can (and should) be provided by users; the goal is to make the syntax lightweight enough to facilitate it.
- Probably not going to have Rank-N types for a while, if ever. I mean, I like Rank-N types, especially for APIs, but most things end up being better expressed by inductive data types, and this way I have a type inference algorithm that is actually tractable …

(A separate section so I donʼt bury Non-goals)

Worship the shape of data and the structure of code …

- Any metatheory that makes dealing with variable binding easier is worth a lot!
- What I did in Dhall-PureScript: Dhall/Variables.purs This just does basic bookkeeping of when variables are bound, based on the functors I used in the recursion schemes, but I think it proved to do most of what I needed.
The other, silly solution, is to commit to only having one binder: lambda, and phrasing pi in terms of lambda. I convinced myself it works out on paper but I got a little stuck trying to prove it to Agda. Heh heh heh …

- Container functors, the building blocks of an AST.
`traverseWithIndex :: (i -> a -> m b) -> (f a -> m (f b))`

`mergeWith :: (i -> a -> b -> c) -> f a -> f b -> Maybe (f c)`

- I believe that we need a lot more binary operations like this, for matching on two shapes at once! It is not something that is covered by recursion schemes for example.
`Data.Map`

has a terrible interface (`unionWith`

is so bleh).

- I believe that we need a lot more binary operations like this, for matching on two shapes at once! It is not something that is covered by recursion schemes for example.
- Zippers/one-hole contexts (optional – I never actually used them in Dhall-PureScript, but they could be useful for some things):
`upZF :: ZF f' x -> f x`

`downZF :: f x -> f (ZF f' x)`

`ixF :: f' x -> i`

- Array stuff
- normal
`zipWith`

- “long”
`zipWith`

`takeWhileJustWithRest :: (a -> Maybe b) -> Array a -> (Array b, Array a)`

- some kind of condensor pattern

- something better than
`mapAccumL`

/`mapAccumR`

lol- every time I want to reach for a stateful traversal, I find it so annoying!

- maybe some actual parser type thing

- normal

*~Disclaimer that I use typechecking and type inference interchangeably.~*

I think it is *very very* useful to move from thinking of unification as a binary operation to it as a N-ary operation. As one example, consider (homogeneous) list literals.

The way a lot of typecheckers work when inferring list literals is that it assumes the first item has the right type, and then it typechecks the remaining items against it. But what if it is the first item that has the wrong type, and all 12 other items are actually right? I believe it is best to typecheck each term in isolation, then see if the results can be unified all at once – and then unify the unification states, since unification variables may have been unified in inconsistent ways. (This requires unification state to be `WriterT`

not `StateT`

. Yeah.)

```
typecheck/ ["ListLiteral" items] => ["App" "ListType" itemType]
map typecheck items => itemTypes
ensureConsistency itemTypes => itemType
```

I would like to be able to short-circuit typechecking non-dependent functions, and return a result even if the argument is ill-typed or does not have the correct type.

(Why? Because having a more global view of errors is often useful, since the hyperlocal errors we are used to can obscure the real problem.)

This would show up as a soft error that allows further typechecking to proceed. Soft errors can be turned into critical errors when we need to be able to trust the result of typechecking, e.g. to know that normalization is going to complete.

```
typecheck/ ["App" fn arg] => resultType:
// Unifies the result with a "Pi" type
typecheck fn => ["Pi" binder domain codomain]
// See if `codomain` does not in fact depend on `binder`
tryApplyConstant binder codomain
? ["constant" resultType]:
// `resultType` got assigned, so this case is not necessary to produce
// *some* result that can inform further type errors, though this node does
// not truly typecheck if it fails:
typecheck arg => domain
// `domain` is a non-linear pattern match, unifying `argType` and `domain`
// (any further references to `domain` would refer to the unified node)
? ["non-constant"]:
// Typecheck the argument in strict mode to ensure that type errors result
// in an immediate failure even if an approximate result can be computed:
strictly ([] => typecheck arg) => domain
// (Unification with `domain` is always strict, it never adds soft errors.)
// Now that it is safe to compute with `arg`, we apply it to compute the
// result type:
substitute binder arg codomain => resultType
!
// Probably should simplify this somehow ...
```

Is this good notation for lambdas as arguments to functions? I donʼt know.

```
strictly | [] => r:
typecheck arg => r
! => domain
```

Macros for currying?

`asdf (!2 append !1 !0 !)`

I want to avoid some problems:

- Indentation. Figuring out how to indent lambdas as arguments to functions is so annoying.
- Related: figuring out where the lambdas end is also annoying. I do like dangling lambdas actually.

`["if" ($matches-tag arg1) (: MyExprType) "then" "true" "else" ($failed-match)]`

```
-- The behavior of `select` for the typechecker monad/thingy is that if the
-- first computation returns a `Left`, it will accumulate errors from the second
-- computation, instead of failing and blocking computation like `>>=`.
--
-- In particular, it does accumulate errors from `strictly`, in that case.
select :: f (Either b a) -> f (a -> b) -> f b
strictly :: f a -> f a
tryApplyConstant :: Binder -> Type -> Maybe Type
typecheck :: Type -> f Type
typecheck (App fn arg) =
select
( typecheck fn >>= \fnType ->
unifyPi fnType >>= \binder domain codomain ->
case tryApplyConstant binder codomain of
Just r -> Left r
Nothing -> Right Unit
)
( strictly $ typecheck arg >>= \argType ->
unify argType domain <#> \_unified ->
apply binder arg codomain
)
```

]]>Just copy the Elm version solver from Haskell to PureScript, itʼll be easy.

Uh huh. Totally.

Oh we need good errors too.

Yup. Thought so.^{1}

And so the feature creep started … but the journey was *so* worth it.

How did I get here and what did I come up with?

A novel algorithm for resolving dependency bounds to solved versions:

- Incorporates transitive dependency bounds for a breadth-first search:
- What dependencies are required no matter which package version in the range we commit to?
- Whatʼs the loosest bound for each dependency then?

- By taking this intuitive approach, we gain two things:
- Better errors, matching what users would expect.
- Efficiency too, if you could believe it.

- Implemented using semilattices (monoids).

(I know youʼre probably not going to read this whole long article and Errors is the very last section, but please feel free to skip ahead to that one since that was the whole point of this exercise!)

The PureScript community has been designing a new registry to hold PureScript packages for some time now. PureScript projects initially used Bower (a defunct npm competitor for Node.js packages), and I embarrassingly hung on to Bower until just last year. Most of the community, however, has been using Spago, a PureScript build tool supporting package sets (fixed versions of packages that are known to be compatible). Long story, but some core members have been designing a new registry to house current and historical PureScript packages. Weʼre very close to releasing it!^{2}

In the interest of maintaining a healthy ecosystem, we want the new registry to support not just package sets but also traditional version solving. And thatʼs where I came in. Something about my mathy skills being a perfect fit for getting nerd-sniped by a version solving algorithm. Oh and would you help fix the versioning issues for legacy packages while youʼre at it? Sure, sure I will.

The challenge of version solving in a package ecosystem is coming up with particular version of packages that satisfy not only the dependencies of the current project, but their own dependencies too. You also want to ensure they are up-to-date by taking the latest possible versions – but sometimes those are not compatible with other declared dependencies. The problem is expected to be difficult and slow to solve in general, but it is possible to optimize for what package dependencies look like in practice, and that is what I have done.

Quick notes on conventions/terminology before we get too far in:

The actual details of how versions are tagged doesnʼt matter, just that they are totally ordered.^{3} For example, it could just be flat integers for all we care. But usually we take them to be lexicographically-ordered lists of integers, like `5.0.3`

which is less than `5.1.0`

.

How we form *ranges* over versions is pretty important, though, and early on the registry decided to only allow half-open intervals. That is, ranges have the form `>=A.B.C <X.Y.Z`

, which I will use throughout this article. Again, it isnʼt very sensitive to details here (who cares that it is half-open?), but this does seem to be the right level of generality. Supporting more complex queries is asking for trouble.

Finally, from a version-solving point of view, a registry contains information of what versions of packages there are, and for each package version a record of what dependencies it requires and the appropriate version ranges for those packages. That is, it can be represented with the following PureScript datatype:

```
-- A list of required dependencies
-- with their version ranges
type Manifest = Map PackageName Range
-- A list of all extant package versions
type RegistryIndex =
Map PackageName
(Map Version Manifest)
```

Solving means taking a manifest and finding versions for each package in it, preferring later versions^{4}:

```
solve
:: RegistryIndex
-> Manifest
-> Either SolverErrors
(Map PackageName Version)
```

Along with some correctness constraints to ensure it is the solution we want.

```
let r :: RegistryIndex
let m :: Manifest
let otherSol :: Map PackageName Version
-- We need the solution to solve the manifest and dependency's requirements
isASolutionFor r m (fromRight (solve r m)) &&
-- There are no strictly better versions to be found
( isASolutionFor r m otherSol
`implies` isn'tWorseSolutionThan otherSol (fromRight (solve r m))
)
where
satisfies
:: Map PackageName Version
-> Map PackageName Range
-> Boolean
satisfies sol m =
allWithIndex
( \package range ->
case Map.lookup package sol of
Nothing -> false
Just version -> range `includes` version
)
m
isASolutionFor
:: RegistryIndex
-> Manifest
-> Map PackageName Version
-> Boolean
isASolutionFor r m sol = and
-- All packages received a version
[ Map.keys m `isSubsetEqOf` Map.keys sol
-- All solved versions fit into the range
-- as required in the manifest
, sol `satisfies` m
-- All packages have their dependencies satisfied
, allWithIndex
( \package version ->
case Map.lookup package r >>= Map.lookup version of
Nothing -> false
Just deps ->
sol `satisfies` deps
)
sol
]
isn'tWorseSolutionThan :: Map PackageName Version -> Map PackageName Version -> Boolean
isn'tWorseSolutionThan other optimal =
Maps.keys optimal `isSubsetEqOf` Map.keys other
&& not allWithIndex
( \package version ->
case Map.lookup package other of
Nothing -> true
)
optimal
-- FIXME
```

In particular, note that dependencies are associated with a particular *version*. A package *range* doesnʼt need to have well-defined dependencies at all!

This is something that we forget about when using packages in our day-to-day lives, but an algorithm needs to handle all cases we could throw at it.

As I alluded to in the intro, I started off by copying Elmʼs version solving algorithm. Itʼs a very simple depth-first backtracking algorithm:

- Try the latest compatible version of the package in front of you, based on the global requirements
- Add its dependency ranges to the global requirements
^{5} - Recursively see if the new global requirements can be solved
- Backtrack to the next latest version at each failure.

Itʼs easy to see why this is worst-case exponential, and not going to hit fast cases particularly often. In fact, we expect the problem to remain worst-case exponential, but spoiler: we can do much better in most reasonable cases!

Besides performance, the main obstacle I wrestled with was that it had no errors. It turns out these are related concerns: because the algorithm is so naïve, it isnʼt making use of available information to make smart choices, and this would reflect in the errors it could produce.

I discovered that this problem of solving package versions corresponds well to what I have been thinking about in terms of compiler/typechecker errors for the past couple years. So thereʼs some good lore here on what I believe errors should look like, but thatʼs for another post.

Basically, good errors should be a faithful reflection of the internal logic of the solver. This is the main hint that performance and errors are linked: if the solver is trying too many bad options, itʼs going to generate a ton of errors for all of those choices. These errors are bad because they mainly reflect bad choices of the solver, not necessarily problems with the underlying data (the manifests). Itʼs only once *every option* has failed that you know that the underlying manifests were not compatible. Our goal later, then, will be to reduce the number of choices to make and commit to errors as soon as possible.

The second problem with the errors is that the naïve backtracking does a *lot* of duplicate work, in between choices of packages. In the worst case scenario, two package versions have the same manifests, so trying them separately will duplicate most of the work!^{6}

It is possible to deduplicate errors after the fact, but those heuristics seem complex in general, and there are two problems still:

- Youʼve already lost the performance associated with the duplicate work, and are spending more time trying to fix it
- You might as well write the algorithm to incorporate the deduplication in the first place!!

There are some existing approaches to increase sharing/reduce duplicate work, in the context of general constraint solving and more particularly version solving with these type of bounds. I briefly glanced at them, but they donʼt seem to address the heart of the issue like my algorithm does.

In a solver algorithm, we write programs in terms of some error monad. The backtracking algorithm essentially corresponds to a complicated Boolean expression, a tree of various constraints joined with conjunction and disjunction. Thinking of it as `Applicative`

purescript+`Alternative`

purescript, we see that `<*>`

purescript corresponds to conjunction `&&`

purescript and `<|>`

purescript corresponds to disjunction `||`

purescript.

```
console >=5.0.0 <6.0.0
(console == 5.0.0 && prelude >=5.0.0 <6.0.0)
|| (console == 5.0.1 && prelude >=5.0.1 <6.0.0)
```

An error, then, is some kind of proof that the Boolean always evaluates to false. SAT solvers have done a great job of doing this in the general case. And you can think a bit about what this means.

In addition to the literal Boolean clauses, we want the errors to record some additional metadata about where they came from: particular manifests and the current dependency from the manifest we are trying to solve.

However, we can only do so much: we remain limited to the logic of the algorithm. With a depth-first algorithm in particular, the errors donʼt convey the global picture that the user is looking for.

I mean, you *can* report these kinds of Boolean clause errors, but they are so confusing that you might as well just throw up your hands and say “I tried something and it didnʼt work.” Thatʼs all the user would get from the errors anyways, since thatʼs really all the algorithm did: It started with an essentially random package, committed to a version of it immediately, tried other things as a consequence, and eventually reported that nothing worked.

So, since my goal was better errors, my next idea was to try to patch it to *run* the depth-first backtracking algorithm, but create a post-mortem analysis to *report* more sensible errors. For example, from the Boolean algebra perspective, you can do basic tricks to factor out common sub-expressions, which you can combine with what you know about comparing versions to ranges.^{7}

I couldnʼt bring myself to write that. So I just wrote a novel breadth-first algorithm.

I spent a significant chunk of time writing it. I spent several weekends debugging its performance.

And the results are amazing. */me pats self on back*

Hereʼs where I admit my biggest weakness: prior art. I have a great difficulty reading existing research on some topics. Especially when the problem is so obviously begging for a nice solution like this! Itʼs easier to work out the details for myself to be honest. And then blog about it so that people who are *not* like me learn what I have done. (Apologies to those who are like me who will never read this and perhaps reinvent it. Godspeed.)

I spent a couple months designing a whole new algorithm from scratch. The basic idea is that we gather as much information we can before committing to any versions. This is done through the use of what I have coined as quasi-transitive dependencies.

The main steps are:

- Load the slice of the registry index that we care about: package versions that are transitively reachable from the package ranges mentioned in the current manifest.
- Gather information about
*quasi-transitive*dependencies for manifests in the registry as well as the current manifest we are solving, looping until there is no more obvious information to discover. - Check if the requirements have hit an error in the requirements already.
- If not, check if we have solved it: do all the latest versions of requirements work as a solution?
- Only as a last resort do we succumb to picking a package and recursively solving each of its versions, starting from the latest.

Note that the quasi-transitive dependencies check essentially commits to unique versions immediately, so by the time we reach step 5 we know that there are at least two possible versions of some dependency and are forced to commit to one to make progress. It turns out that in practice, we already hit errors before we have to do that, so weʼve avoided the worst of the exponential blowup!

Recall what I said in Dependencies are tricky: “A package *range* doesnʼt need to have well-defined dependencies at all!” Oh – but they often *do* in practice.

If we can get extra knowledge about requirements before committing to any particular versions, we have a chance at implementing some sort of breadth-first search.

How much extra knowledge we obtain depends on how packages treat their dependency bounds in the registry. In the case of how PureScript packages tend to bound dependencies, it turns out to be a lot of knowledge. This is because most stable PureScript libraries update with each breaking compiler release and depend on the corresponding major version range of `prelude`

and various other core packages. Since a lot of versions move in lockstep, it is pretty safe to assign loose dependencies to a package range and even reach for further transitive dependencies.

In general, when bumping minor and patch versions, packages tend to keep the same list of dependencies at similar version ranges. Things are a bit more chaotic between major versions, but it is rarer that packages allowed different major versions in their manifests in the first place, and so there is some semblance of continuity.

Now we need to use this to our advantage:

The idea is that we come up with *quasi-transitive dependencies* for a package range – a lower bound of the absolutely necessary requirements that follow from a package *range* being required.

There are two rules here:

- If a package is not required by all versions in the range, we cannot say it is required overall.
- When it
*is*depended on by all versions in a range, we take the loosest bounds we see: the lowest lower bound and the greatest upper bound.

It turns out that we can formulate this rule as a semigroup instance that applies the logic for us to a collection of manifests:

```
instance Semigroup (App (Map PackageName) Loose) where
append (App m1) (App m2) = append <$> m1 <*> m2
foldMap1
:: NonEmptyArray (App (Map PackageName) Loose)
-> App (Map PackageName) Loose
instance Coercible Manifest (App (Map PackageName) Loose)
```

Note that this is in fact not a monoid: `Map`

purescript only has an `Apply`

purescript instance (which gives the `<*>`

purescript operator to merge common keys), not `Applicative`

purescript (which would give `pure`

purescript but does not make sense for `Map`

purescript since it would have to contain *all* possible keys!).

As a further optimization, while we are checking package versions, we may discard those that do not solve due to an obvious conflict. This may seem strange: In the PureScript registry, each package will solve individually, we check that on upload. But given the additional constraints of a particular manifest we are solving, we may end up with conflicts against various package versions that are incompatible with the global requirements, especially as we continue to aggregate quasi-transitive dependencies.

```
-- | We record what dependency ranges are required no matter which version
-- | of the package we pick from the registry. That is, we report the loosest
-- | bounds when all packages report a bound for it. By filling in transitive
-- | dependencies on the registry itself, then, these bounds become more
-- | accurate.
-- |
-- | Also note that removing the redundant requirements via `addFrom` is safe
-- | with the assumptions here: if one local requirement is equal to or looser
-- | than a global requirement, then this result here would also be equal to or
-- | looser than the global requirement.
commonDependencies
:: TransitivizedRegistry
-> PackageName
-> Intersection
-> SemigroupMap PackageName Intersection
commonDependencies registry package range =
let
inRange =
getPackageRange registry package range
solvableInRange =
Array.mapMaybe (traverse toLoose) (Array.fromFoldable inRange)
in
case NEA.fromArray solvableInRange of
Nothing -> mempty
Just versionDependencies ->
case NEA.foldMap1 App (un SemigroupMap <$> versionDependencies) of
App reqs ->
SemigroupMap $ reqs <#> asDependencyOf range <<< fromLoose
```

This quasi-transitive dependency business looks a bit like a familiar formula: the composition of two relations in logic.

Phrased in terms of set theory, Wikipedia says:

If $R \subseteq X \times Y$ and $S \subset Y \times Z$ are two binary relations, then their composition $R;S$ . . . is defined by the rule that says $(x,z)\in R;S$ if and only if there is an element $y\in Y$ such that $x\,R\,y\,S\,z$ (that is, $(x,y)\in R$ and $(y,z)\in S$).

The key part here is that we take our input and our output and we ask: is there something *in the middle* that serves to connect the input to the output? (Thinking of relations as boxes that connect certain inputs to certain outputs.)

However, we arenʼt dealing with general relations here, weʼre only dealing with half-open intervals. Weʼre asking: for a version *range*, what *range* is constructed by taking the ranges of *each version* in the middle?

To be a bit more direct with this analogy, a relation $R \subseteq X \times Y$ can equivalently be written as $R \in \mathcal{P}(X \times Y)$. ($\mathcal{P}(Z)$ here is the powerset monad $\mathcal{P}(Z) = Z \to \textrm{Prop}$, which consists of all subsets of the given set $Z$.) And by currying, this can be viewed as $R \in X \to \mathcal{P}(Y)$. This construction $X \to M(Y)$ for a monad $M$ is called the Kleisli category. So now the question is: do intervals also form a monad, by taking loose bounds?

The easy answer is that we can certainly think of it as an approximation on top of the underlying set-relation model. That is, we know how to make intervallic dependencies a relation, so we compose them as relations and then take the smallest interval that contains every interval we came across.

Perhaps there is a way to categorify it directly, I donʼt know. We can come up with an identity, but Iʼm not so sure that associativity would hold.

To see how it fits. Unit -> Package X range -> Package Y range (X depends on Y)

Thatʼs only dealing with versions of a single package. Bundle it together.

The core backtracking algorithm actually still exists in the spine of the solver, but its role is greatly reduced. In fact, this has a funny implication for testing the algorithm: *the correctness is visible not in finding the right solutions but in the algorithmʼs efficiency and errors.*

The literal results of the solver were accurate all along. But when I finally got it working *fast*, I knew all my logic was in place for all the intermediate steps. In particular, this means that we preempted most of the (exponential) backtracking.

Again, a topic for another blog post, but I love monoids, especially semilattices, because they capture information gathering in ways that lend themselves to reliable implementation.

In particular, because of their idempotence, semilattices are great because you just need to make sure you cover all cases. Thereʼs no such thing as double-counting in a semilattice computation! When youʼre dealing with a well-behaved logical scenario, if have written your logic correctly (i.e. each derivation is valid) and you cover all the cases (you eventually produce every fact you are allowed to derive), thereʼs no chance that you accidentally make things break.^{8}

We already saw our first semilattice `Semigroup (App (Map PackageName) Loose)`

purescript above. However, I left out the definition of `Loose`

purescript and its `Semigroup`

purescript instance.

The *data* contained in `Loose`

purescript is just a lower bound and an upper bound, and we want the lower bound to be less than the upper bound for it to be valid. We also pack in *metadata* that describes where each bound came from, the `SolverPosition`

purescript datatype which we will discuss below in Provenance.

To achieve this, we first define a type that describes a bound with metadata packed in. Then we add to this operations that take the maximum and minimum of the bounds, and *aggregate* the metadata if they were the same bound. Thatʼs right, **the metadata itself forms a semilattice!**^{9}

```
data Sourced = Sourced Version SolverPosition
newtype MinSourced = MinSourced Sourced
instance Semigroup MinSourced where
append a@(MinSourced (Sourced av as)) b@(MinSourced (Sourced bv bs)) =
case compare av bv of
LT -> a
GT -> b
EQ -> MinSourced (Sourced av (as <> bs))
newtype MaxSourced = MaxSourced Sourced
instance Semigroup MaxSourced where
append a@(MaxSourced (Sourced av as)) b@(MaxSourced (Sourced bv bs)) =
case compare av bv of
GT -> a
LT -> b
EQ -> MaxSourced (Sourced av (as <> bs))
```

Now we get both `Loose`

purescript and `Intersection`

purescript for free by the right arrangement of these types. Heck, we even get their coercion for free:

```
newtype Loose = Loose
{ lower :: MinSourced
, upper :: MaxSourced
}
derive newtype instance Semigroup Loose
newtype Intersection = Intersection
{ lower :: MaxSourced
, upper :: MinSourced
}
derive newtype instance Semigroup Intersection
-- API for `Intersection`
upperBound :: Intersection -> Version
upperBound (Intersection { upper: MinSourced (Sourced v _) }) = v
lowerBound :: Intersection -> Version
lowerBound (Intersection { lower: MaxSourced (Sourced v _) }) = v
good :: Intersection -> Boolean
good i = lowerBound i < upperBound i
satisfies
:: Version -> Intersection -> Boolean
satisfies v r = v >= lowerBound r && v < upperBound r
-- `Loose` has to be a valid interval
toLoose :: Intersection -> Maybe Loose
toLoose i | good i = Just (coerce i)
toLoose _ = Nothing
fromLoose :: Loose -> Intersection
fromLoose = coerce
```

Why donʼt we require `Intersection`

purescript to be a valid interval? As we will talk about in the next section, `Intersection`

purescript is the primary way we keep track of the knowledge we have learned already. Being in the business of aggregating information, we want to know all we can about the situation our solver is confronted with, and we just can accumulate knowledge by throwing it into this semilattice.

We could make taking the intersection of intervals a partially-defined operation (`Intersection -> Intersection -> Either Error Intersection`

purescript), but that means we have to bail out once a single intersection becomes invalid. Instead, we integrate them directly into the semilattice structure by keeping invalid intervals around and turning them into errors later (this is why we give them the metadata about provenance!). This gives us multiple errors emerging from one step for free, it is incredibly convenient.

Figuring out the correct way to propagate known requirements kept me occupied for days. It turns out I had done it wrong the first time, so it is good I thought it over again!

Our goal is to implement `solveStep`

purescript here using `commonDependencies`

purescript (see above) and `exploreTransitiveDependencies`

purescript:

```
-- Semilattice version of `Registry`
type TransitivizedRegistry =
SemigroupMap PackageName
(SemigroupMap Version
(SemigroupMap PackageName Intersection)
)
type RRU =
{ registry :: TransitivizedRegistry
, required :: SemigroupMap PackageName Intersection
, updated :: TransitivizedRegistry
}
-- | Discover one step of quasi transitive dependencies, for known requirements
-- | and the rest of the registry too.
solveStep :: RRU -> RRU
-- Key piece:
exploreTransitiveDependencies :: RRU -> RRU
```

The `registry :: TransitivizedRegistry`

purescript and `required :: SemigroupMap PackageName Intersection`

purescript represent the local dependencies for each package version and the global requirements of the initial manifest given to the solver, respectively. They both are purely accumulative: what goes in comes out with some more information. The additional information will simply be added dependencies and tightened bounds on existing dependencies. Provenance metadata may accumulate too (we donʼt really need to care about that, it is just along for the ride).

The other field, `updated :: TransitivizedRegistry`

purescript, is a bit different: it does not carry over from step to step, it only talks about what changed at the last step. This is because as weʼre keeping `registry :: TransitivizedRegistry`

purescript updated, we want to only calculate updates to the things that might need it.

When we first call `solveStep`

purescript, we treat everything as updated:

```
solveSeed :: RR () -> RRU
solveSeed { registry, required } = { registry, required, updated: registry }
```

and the process stabilizes when there are no updates:

```
-- | Add quasi transitive dependencies until it stabilizes (no more updates).
-- | Needs to know what was updated since it last ran.
solveSteps :: RRU -> RR ()
solveSteps r0 = go r0
where
go r@{ registry, required } | noUpdates r = { registry, required }
go r = go (solveStep r)
```

Keeping track of what was updated is certainly the trickiest part of the whole algorithm to reason about, but there is this one nugget of insight that coalesced into the knowledge I needed to turn it into an algorithm:

The manifests for package versions might need to update when some of their dependencies update. However, not all updates need to propagate like this from dependencies to their reverse dependencies.

In particular, in the case that a manifest is updating because its dependencies tightened, *if* this could affect its reverse dependencies they should *already* be depending on the transitive dependencies directly and updating because of it. This leaves us with the only major updates being because a dependency was *added*, which the parent did not know about yet so it needs to rescan its dependencies to potentially add the dependency itself.

The other case is that if a package version picks up an obvious failure, its reverse dependencies need to be notified. They may pick up a quasi-transitive dependency once this failing package version is dropped, if it was missing that particular dependency but others had it.

```
-- | A package may update because its dependencies tightened, but any reverse
-- | dependencies should have already caught that update in this same tick.
-- | So what we look for is either a new transitive dependency picked up (which
-- | the parent will need to incorporate), or newly failing to solve,
-- | both of which may introduce new dependencies for reverse dependencies
-- | through the `commonDependencies` calculation.
majorUpdate :: SemigroupMap PackageName Intersection -> SemigroupMap PackageName Intersection -> SemigroupMap PackageName Intersection -> Boolean
majorUpdate (SemigroupMap required) (SemigroupMap orig) updated =
let
minor = { added: false, failedAlready: false, failedNow: false }
info :: { added :: Boolean, failedNow :: Boolean, failedAlready :: Boolean }
info = updated # anyWithIndex \package range ->
case Map.lookup package orig of
Nothing ->
-- This bound may have been omitted merely because it was subsumed by
-- a global requirement (see `addFrom`), so adding it back does not
-- count as a major update:
case Map.lookup package required of
Nothing -> minor { added = true }
Just range' -> minor { added = lowerBound range > lowerBound range' || upperBound range < upperBound range' }
Just r -> minor { failedAlready = not good r, failedNow = not good range }
in
case info of
{ added: true } -> true
{ failedNow: true, failedAlready: false } -> true
_ -> false
-- | Update package versions in the registry with their quasi-transitive
-- | dependencies, if their dependencies were updated in the last tick. The set
-- | global requirements is needed here because those are elided from the
-- | dependencies in each package version, so to tell how the local requirements
-- | updated we need need to peek at that (see `majorUpdate`).
exploreTransitiveDependencies :: RRU -> RRU
exploreTransitiveDependencies lastTick = (\t -> { required: lastTick.required, updated: accumulated (fst t), registry: snd t }) $
lastTick.registry # traverseWithIndex \package -> traverseWithIndex \version deps ->
let
updateOne depName depRange = case Map.isEmpty (unwrap (getPackageRange lastTick.updated depName depRange)) of
true -> mempty
false -> Tuple (Disj true) (commonDependencies lastTick.registry depName depRange)
Tuple (Disj peek) newDeps = foldMapWithIndex updateOne deps
-- keep GC churn down by re-using old deps if nothing changed, maybe?
dependencies = if peek then deps <> newDeps else deps
updated = case peek && majorUpdate lastTick.required deps dependencies of
true -> doubleton package version dependencies
false -> mempty
in
Tuple updated dependencies
-- | Discover one step of quasi transitive dependencies, for known requirements
-- | and the rest of the registry too.
solveStep :: RRU -> RRU
solveStep initial =
{ required: initial.required <> moreRequired
, registry: moreRegistry
, updated: updated <> updatedOfReqs
}
where
-- Transitivize direct requirements
moreRequired = initial.required # foldMapWithIndex (commonDependencies initial.registry)
-- Record updates to them
updatedOfReqs = requirementUpdates initial moreRequired
-- Transitivize the rest of the registry, which should be:
-- (1) Pruned at the start to only reachable package versions
-- (2) Only touching packages that were directly updated last round
{ updated, registry: moreRegistry } = exploreTransitiveDependencies (initial { registry = map (addFrom moreRequired) <$> initial.registry })
```

It turns out that the algorithm is naturally efficient, with some help.

The biggest trick is *using global constraints to discard redundant local constraints*. That is, if the manifest you are solving already constrains `prelude >=6.0.0 <7.0.0`

, then each package that lists that requirement or a looser one can ignore it.

```
-- | The key to efficiency: take information from the bounds of global
-- | requirements and add it to the local requirements of each package version
-- | in the registry, BUT remove redundant bounds as we do so.
-- |
-- | For example, if we have a global requirement `>=3.1.0 <4.0.0`, then in the
-- | registry we will keep local dependency ranges for the same package that
-- | look like `>=3.2.0 <4.0.0` or `>=3.1.0 <3.9.0` and remove ranges like
-- | `>=3.0.0 <4.0.0` or `>=3.1.0 <4.0.0` itself.
addFrom
:: SemigroupMap PackageName Intersection
-> SemigroupMap PackageName Intersection
-> SemigroupMap PackageName Intersection
addFrom (SemigroupMap required) =
over SemigroupMap $ Map.mapMaybeWithKey \package ->
case Map.lookup package required of
Nothing -> Just
Just i -> \j ->
if j `wouldUpdate` i then Just (j <> i)
else Nothing
-- | Used in `addFrom, `wouldUpdate j i` is an optimized version of
-- | `(i <> j /= i)`.
wouldUpdate :: Intersection -> Intersection -> Boolean
wouldUpdate j i =
lowerBound j > lowerBound i ||
upperBound j < upperBound i
```

Unfortunately I had to add a bit of special casing in the propagation to handle this, in particular when checking for major updates, but the exceptional efficiency is more than worth the slight inelegance.

I almost made a profiling analysis library. JavaScript performance testing is useless because it gets washed away in a sea of lambdas, and I couldnʼt find/make a tool to aggregate the lambda information into their named parents. Wrap particular segments in profiling.

I also needed a histogram viewer.

Lots of micro optimizations.

- Using a specific order of
`<>`

purescript, since`Map`

purescript appends are implemented as a fold over the second argument so it should be the smaller argument. - Using a difflist (Cayley) representation when I know Iʼm only appending one key at a time but with mixed associativity.
- Implementing
`wouldUpdate`

purescript directly instead of using the semigroup operation. - Optimizing the
`Ord Version`

purescript instance since it is the most common operation in this whole thing.

Did they make a difference? I donʼt know! They appeared to make incremental difference as I was testing it, but once I did the big optimization above I gave up on testing that.

Room for improvement. But decent off the bat. And a clear direction for improvement, unlike depth-first algorithms.

Conflicts. (Conflict “clauses.”) The problem with backtracking was that the errors . Particular clauses could conflict, sure, but then you had to work out why that made the whole boolean expression fail, and what that corresponds to in the version solving model.

In the new model, since we just keep adding requirements at each step to tighten bounds, the basic form of conflict is really simple: a required upper bound got pushed below a required lower bound. Or, we could have restricted to a range that has no registered versions.^{10}

There are two ways we combine these errors within the logic of the solver:

- First we note that we may encounter errors in multiple requirements at the same time, so we keep a (non-empty) map of package names to their conflicts. (Reporting multiple errors at once is very helpful!)
- Second it may be the case that a package has versions in range, but we happen to know that none of them are still solvable, they all have conflicts of their own. (We actually just do a very shallow check of this.)

This gets us this data type for errors:

```
data SolverError
= Conflicts (Map PackageName Intersection)
| WhileSolving PackageName (Map Version SolverError)
```

This isnʼt completely faithful to the logic of the solver. You have to trust that the system determined these are required: it wonʼt tell you exactly what decisions led to it requiring it.

But it does keep around provenance information that tells you enough about where it originated.

Normally I like to keep full provenance to detail exactly the path a piece of data took to get through the logic.^{11} However, it is really slick in this domain: we only need to keep track of the endpoints, users donʼt exactly care about what came in between (just that it is reasonable to assume, because it is in fact correct).

So in this case I keep track of which particular package version manifest(s) gave us the constraint we are talking about (`LocalSolverPosition`

purescript), and which constraints in the current manifest caused it to be required. Thereʼs some logic to combine these positions which I will not reproduce here.

```
data LocalSolverPosition
-- | Dependency asked for in manifest
= Root
-- | Committed to a specific version
| Trial
-- | Required transitive dependency seen in said packages
| Solving
( NonEmptySet
{ package :: PackageName
, version :: Version
}
)
data SolverPosition = Pos LocalSolverPosition (Set PackageName)
```

It seems that it weakens the logical connection just a bit, I donʼt know if they can be put into formal properties anymore. (E.g. “Deleting the mentioned constraint from the current manifest affects it in *this* way.”)

But I believe it is the information that users want to see; it certainly falls into the category of making it actionable so they can fix things and run it again to make progress. In the case that it is a local error, knowing which clauses of the current manifest led to it is crucial in answering the question, “What do I need to change to fix the error”. And sometimes it is a deeper error of outdated dependencies, so you want to know what package is responsible for that incongruous version requirement.

Itʼs interesting that nothing here required that dependencies are acyclic. I actually made some tiny decisions that ensured that this would work, without causing an infinite loop for example, but it was minor things.

The problem statement:

Given a pen nib of some shape, what composite shape is produced when that pen is drawn along any particular path?

If the inputs are cubic Bézier curves, is the output as well?

Is the Minkowski sum of two piecewise cubic Bézier hulls a piecewise cubic Bézier hull?

More specifically, is the the convolution of two cubic Bézier curves a cubic Bézier curve?

The catch? Itʼs mathematically impossible to model the output using cubic curves, as I determined after a bit calculus. In fact, it fails already for *quadratic* curves (the simpler companion to cubic curves, which would have simpler, more tractable solutions).

The cubic in “cubic Bézier curve” refers to the fact that they are parametric curves modeled by *cubic polynomials* (one polynomial for the $x$ coordinate and one polynomial for the $y$ coordinate, in terms of a shared time variable $t$). Simply put, the solution for how the curves of the pen and the curves of the path interact means that the solution wonʼt be a polynomial anymore, it would at least be a rational function, i.e. a polynomial divided by another polynomial.

However, that doesnʼt prevent us from getting pretty darn close. Let me show you how it works out.

Cubic polynomials have nothing to do with Cubism. At least, not that I know of.

You want to know how I got here? The story begins with a car trip. My customary activity while riding in the car is to invent new writing systems, drawing on my tablet to try different calligraphic curves to start establishing what shapes I want to represent various sounds. (I tend to develop featural writing systems based on sounds represented by the International Phonetic Alphabet.)

Of course, doing this in the car is hard mode already! The bumps of the car mean I have to use the tablets undo feature for more strokes than not. Plus, not only are there the mistakes of my hand being jostled by the car going over bumps, thereʼs also the mistakes of me just being mediocre at calligraphy, *plus* the fact that I have to teach myself the script as Iʼm inventing it! (I do love the creative interaction of drawing random shapes, seeing what feels good, and refining it both intentionally and through the natural iterations.)

Iʼve done this for many years, since before I could drive. As long as Iʼve done that, Iʼve also wanted to digitize the shapes, maybe to make them into a computer font so I donʼt have to manually write things out when I want to see how full sentences look. (ʼTwould also be a great way to explore ligatures and open type features to really make a natural flowing calligraphic font …)

As I mentioned above, the precisely stated mathematical problem says the curves we are looking for arenʼt the type of curves supported by graphics programs today. But why let the mathematical impossibilities get in the way of actually quite good enough? It took me until now to have the skills/insight/motivation to finally realize my dream, and I want to share the result and the process with you.

But first, always start with the demo! Here you can see the musculoskeletal anatomy of a Minkowski calligraphy stroke:

- Black area – the algorithm as it current stands.
- Red area – the ideal output, approximated. (double click on the black to generate a more and more fine approximation)
- Green lines – the patchwork of simple segments (click to debug, double click to delete).
- Orange lines – the special paths added via the approximate convolution algorithm.

We can take apart the pen nib and pen path into a bunch of segments and compute the composite of each segment with each segment (Cartesian product). Each composite of individual segments produces a section of the result, resulting in a patchwork of sections that form the whole composite shape. Having a lot of overlapping sections is OK, since e.g. Inkscapeʼs builtin Boolean operations will simplify the shapes for us.^{1}

In fact, we will end up subdividing the original segments a bunch to produce accurate results:

- We donʼt want any self-intersecting segments [not implemented yet].
- We also donʼt want any segments that stop (their derivative is zero).
- The pen nib needs to be split up at inflection points, so its slope is monotonic along the segment.
- This is because the slope of the pen path needs to be mapped onto the pen nib, and we want a unique solution.

- The pen nib also needs to be split up so that it doesnʼt loop around [not implemented yet].
- More specifically, each segment needs to be split at the tangent of an endpoint.

- The pen paths need to be split at the tangents of the endpoints of the pen nib segment it is being combined with.
- This ensures that each segment either traces out the obvious thing or has a composite.
- Basically we want that either the tangents are disjoint or the pen nibʼs tangents are contained in the pen pathʼs tangents.

Then the task is to come up with the sections that form the patchwork:

- The original segments form corners (especially if they are disjoint)
- With only this, you essentially get stamps at the endpoints, connected by rakes from the points; see the smooth sailing section.

- Finally the special composite that we will spend a lot of time discussing (if the tangents are a subset)

But thatʼs the end result, letʼs see how I got to this algorithm.

The simplest approach is to paste each path on each segment, something like this:

`(Ps,Qs) => Ps.flatMap(P => Qs.flatMap(Q => [shift(Q, P[0]), shift(P, Q[0])]))`

Mathematically we would say weʼre taking the Cartesian product of the segments.

I was able to do this much in Inkscape directly: copy some segments, align them. You can even make linked clones so it updates altogether.

But there were problems: when the paths crossed over it got noticeably too thin. Even before then, the curves were trending too close, as you can see by double-clicking on it to reveal the red approximation. Basically if the pen nib wasnʼt made of perfectly straight lines, the composite stroke would be missing stuff.

Essentially this process is simulating what you would get by stamping the pen nib in certain points, and then drawing a rake through the path to connect them with curves. (The rake only touching the paper at the segmentation points of the pen nib.)

It appeared that anything more complex would require algorithmic help, and oh was I right … There were more issues lurking with curved segments.

Where tangents go wrong.

I sat down and tried to analyze where this occurred. My first thought was what I said above: itʼs where the crossovers happen, right? Right??

However, I realized that canʼt possibly be right: when the curves fail to cover the actual sweep of the pen, it has already separated by the time the curves actually cross over each other, and continues afterwards. That is, the cross-over is a symptom of the issue but not the part the delimits it.

Looking at it more closely (literally) I realized that the separation occurs precisely when the path of the pen parallels the endpoint tangent of one of the curvy segments of the pen nib.

My first thought was to stamp out the problem: insert more stamps of the pen nib at these problematic tangent points where it wants to detach from the real path. Little did I know this was only the start of unraveling a long thread … it was not enough! For longer curvy segments, it was clear that the extra stamps only masked the problem and did not account for what lay between them.

The main insight, which I have already spoiled for you, is that we need to find some composite of the curves of the pen nib with the curves in the pen path, a composite which is not identical to either curve.

To step back from calligraphy for a moment, consider a simpler example: drawing with a circular sponge, marker, whatever. Make it comically big in your mind.

If you draw a straight line, the composite path is very simple: the two endpoints are capped by a semicircle and are connected by a rectangle.

Now consider a curved path: you can quickly imagine that two semicircles joined by the exact path will not do the job. First of all the endpoints are wrong to connect with the endcaps, second of all the curve would look funny!

If you slow down and look at some point on the curve very closely, what points on the circle are actually doing the work of drawing? What part of the circle is extremally far from the curve at that point? The part that is tangent to the curve!

Thus we will end up offsetting each point on the curve by the radius perpendicular to the curveʼs tangent.

In fact, this offset curve is no longer a Bézier curve: it is an analytic curve of degree 10.

Funnily enough, although it is mathematically complicated, all graphics programs support approximating this cubic Bézier + circular pen combo: this is just the stroke width parameter of SVG Bézier curves.

As far as the main topic of this post goes, the underlying mathematic impossibility should not discourage us quite yet: circles cannot be exactly captured by Bézier curves either, so our focus on cubic Bézier pen nibs may still be okay. (Spoiler: it is not.)

This thought experiment shows that we really want to find the tangent point on the pen nib that corresponds with the tangent from the pen path. If we can correlate the two for each point in time, we would get a composite path that fills out the proper area, more than the rake and stamp method.

Now we can find a precise curve to work towards: given two “nice” curves, we add up all the points where their tangents are parallel, to obtain a new curve. (This is called the convolution of the two curves.)

We hope to solve this in the case of cubic curves in particular: given a tangent from one curve (the pen path), find the time when the other curve (the pen nib) has the same tangent, and add those points together.

Letʼs try a simpler thing first and see why it fails: we can keep the pen path as a cubic Bézier, but restrict the pen nib to being quadratic.

Taking the tangent *vector* of each curve decreases degree by one: the cubic Bézier has a quadratic tangent *vector*, and the quadratic Bézier has a linear tangent *vector*. This sounds okay so far, but recall that we want the tangents to have the same *slope* (to be parallel). This makes us take fractions (see below for more details in the cubic case).

So the solution is a rational function (ratio of two polynomials). Bézier curves are polynomials, not rational functions, so the result will not be a Bézier curve.

Dealing with cubic curves, their tangent vector (being the derivative of their position vector) is a quadratic function. We want the two tangent vectors to be parallel, so we end up with a quadratic equation of one in terms of the other. Solving the quadratic equation introduces radicals, so it is no longer even a rational function in the cubic case.

So we set about approximating the exact curve by a Bézier one.

The first question to ask ourselves is: what are the tangents of the curve at the endpoints? This is a simple question, actually: since we are picking points from the curves where the tangents match, the tangent is simply what it was for the base curve. (We will work through the math below to prove why this is the case.)

If we were approximating by a quadratic curve, this would be all we need to know: the last control point would be at the intersection of the tangents from the endpoints.

But since it is cubic, we have two more points to pick, which should correspond abstractly to another parameter to control at each endpoint, in addition to the tangent there.

Youʼd think this parameter would be curvature. Youʼd **think**!!

The obvious first parameter to control is the tangent angles. If we were approximating with quadratic curves, this would be all there is to it: two points and two tangents.

Working with cubic curves, however, we expect an extra degree of freedom: more control over how it matches our curve. The obvious place to look is curvature.

Curvature (often denoted $\kappa$) is a geometric quantity that captures second-order behavior of a curve, regardless of parametrization. That is, regardless of how $t$ maps to actual $\langle x, y \rangle$ points on the curve, if the curve looks the same, it will have the same curvature.

However, it is not so simple to map curvature onto the Bézier curve parameters, as weʼll see next.

For one thing, the formula involves a complex mix of components: $\kappa = \frac{x'y'' - x''y'}{(x'^2 + y'^2)^{3/2}}.$

And then add in the complexities of the Bézier parameterization and you have a fun problem that yields non-unique solutions.

The proliferation of solutions is kind of problematic since it means we need to guess what is the right solution. At least we know we are looking for clean-looking solutions that do not deviate too much.

Itʼs funny: in order to display curvature in a geometrically meaningful way, you want it to be in units of distance, which means youʼd take its inverse $1/\kappa$. This inverse is the radius of the *osculating circle* that just barely touches the curve in the most graceful way. (Perhaps you know that you can form a circle passing through any three points, possibly a circle with infinite radius if the points are along the same line. This is why it is a second-order property.)

However, despite being in the right units, the radius of the osculating circle is poorly behaved because it can blow up to infinity when the curvature is near zero! (E.g. near inflection points.)

So people often resort to displaying curvature as a kind of vector field associated with the curve, with some implicit conversion of units from inverse distance to real distance.

There is a third-order analogue of curvature called aberrancy. It is related to the *osculating parabola*, since parabolas rotated in space can be fit to four points.

In which we work through the math to compute a cubic Bézier approximation to cubic Bézier convolution, based on matching the known curvature at the endpoints of the exact convolution.

If you want to dig into the details youʼll want some familiarity with vectors, calculus, and parametric curves.

Bézier curves are parametric curves based on some control points. Weʼll only be dealing with 2D cubic Bézier curves. Weʼll put the control points in boldface like $\mathbf{P}$, $\mathbf{Q}$, and give the 2D vectors arrows over top like $\vec{u}$, $\vec{v}$.

$\mathbf{P} = [\langle \mathbf{P}_{0x}, \mathbf{P}_{0y}\rangle, \langle \mathbf{P}_{1x}, \mathbf{P}_{1y}\rangle, \langle \mathbf{P}_{2x}, \mathbf{P}_{2y}\rangle, \langle \mathbf{P}_{3x}, \mathbf{P}_{3y}\rangle].$ $\mathbf{P} = [\mathbf{P}_{0}, \mathbf{P}_{1}, \mathbf{P}_{2}, \mathbf{P}_{3}].$ $\mathbf{P}_ {x} = [\mathbf{P}_{0x}, \mathbf{P}_{1x}, \mathbf{P}_{2x}, \mathbf{P}_{3x}].$

We need the formula for the Bézier polynomial that results from the control points, and weʼll also need its first and second derivatives:

$\begin{aligned} \displaystyle \mathbf{B}_\mathbf{P}(t) &= (1-t)^3 \mathbf{P}_0 + 3t(1-t)^2 \mathbf{P}_1 + 3t^2(1-t) \mathbf{P}_2 + t^3 \mathbf{P}_3\\ &= \mathbf{P}_0 + 3(\mathbf{P}_1 - \mathbf{P}_0)t + 3(\mathbf{P}_2 - 2\mathbf{P}_1 + \mathbf{P}_0)t^2 + (\mathbf{P}_3 - 3\mathbf{P}_2 + 3\mathbf{P}_1 - \mathbf{P}_0)t^3.\\ \mathbf{B}'_\mathbf{P}(t) &= 3(1-t)^{2}(\mathbf{P} _{1}-\mathbf{P} _{0})+6(1-t)t(\mathbf{P} _{2}-\mathbf{P} _{1})+3t^{2}(\mathbf{P} _{3}-\mathbf{P} _{2})\\ &= 3(\mathbf{P}_1 - \mathbf{P}_0) + 6(\mathbf{P}_2 - 2\mathbf{P}_1 + \mathbf{P}_0)t + 3(\mathbf{P}_3 - 3\mathbf{P}_2 + 3\mathbf{P}_1 - \mathbf{P}_0)t^2.\\ \mathbf{B}''_\mathbf{P}(t) &= 6(1-t)(\mathbf{P} _{2}-2\mathbf{P} _{1}+\mathbf{P} _{0})+6t(\mathbf{P} _{3}-2\mathbf{P} _{2}+\mathbf{P} _{1})\\ &= 6(\mathbf{P}_2 - 2\mathbf{P}_1 + \mathbf{P}_0) + 6(\mathbf{P}_3 - 3\mathbf{P}_2 + 3\mathbf{P}_1 - \mathbf{P}_0)t.\\ \end{aligned}$

Naturally the first derivative of a cubic Bézier is quadratic, and the second derivative is linear.

Normally we think of it in 3D space, because the cross product of two 3D vectors is another 3D vector. But it also works in 2D space, it just produces a scalar (1D vector) instead! And it turns out to be a useful abstraction for a lot of our calculations.

$\vec{u} \times \vec{v} = u_x v_y - u_y v_x.$

The two curves are controlled by their own $t$ parameters that are independent of each otherʼs! We need to match them up somehow, and as discussed above, thereʼs a particular way to do that for our application: We need to find the times when they have *parallel tangent lines*, since that will tell us what is the furthest point of the pen nib ($\mathbf{Q}$) (locally) along any given point of the pen path ($\mathbf{P}$).

The slope of the tangent line at time $t$ is given by the ratio of the $y$ and $x$ components of the first derivative of the curve. $\frac{\mathbf{B}'_\mathbf{P}(t)_y}{\mathbf{B}'_\mathbf{P}(t)_x}.$

Weʼll start using $p = t_\mathbf{P}$ to refer to the time along curve $\mathbf{P}$ and $q = t_\mathbf{Q}$ to refer to the time along curve $\mathbf{Q}$. Weʼll think of it as solving for $q$ in terms of $p$; $p$ is the input and $q$ the output. Our goal is to match them up, so the curves have the same slope at corresponding times!

$\frac{\mathbf{B}'_\mathbf{Q}(q)_y}{\mathbf{B}'_\mathbf{Q}(q)_x} = \frac{\mathbf{B}'_\mathbf{P}(p)_y}{\mathbf{B}'_\mathbf{P}(p)_x}.$

**Cross** multiply ${\mathbf{B}'_\mathbf{Q}(q)_y} {\mathbf{B}'_\mathbf{P}(p)_x} = {\mathbf{B}'_\mathbf{Q}(q)_x} {\mathbf{B}'_\mathbf{P}(p)_y}.$

Or use the **cross** product: $\mathbf{B}'_\mathbf{Q}(q) \times \mathbf{B}'_\mathbf{P}(p) = 0.$

(Recall that the cross product is a measure of how *perpendicular* two vectors are, so they are *parallel* exactly when their cross product is zero. This is true in 2D just like in 3D, itʼs just that the cross product is now a scalar quantity, not a vector quantity.)

What does this get us? Well, we can think of it either way, but letʼs assume that weʼre given $p$, so we plug it in and obtain an equation to solve for $q$. Since $\mathbf{B}'_ \mathbf{Q}$ is quadratic, we get a quadratic equation to solve, with some nasty scalar coefficients $a$, $b$, and $c$ coming from the control points of our curves $\mathbf{P}$ and $\mathbf{Q}$, evaluated at $p$:^{2} $a(p)q^2 + b(p)q + c(p) = 0.$

Obviously it gets tedious to write all of that, so we omit the $p$ parameter and simply write: $aq^2 + bq + c = 0.$

Thereʼs a few issues we run into.

The first is that the solution doesnʼt necessarily lie on the actual Bézier segment drawn out by $0 \le q \le 1$.

Second there might be two solutions, since weʼre solving a quadratic!

The solution to both is to split things up! We need to split up the *pen path* so it indexes the tangents at the end of the Bézier segments of the pen nib, after first splitting the *pen nib* at its inflection points.

Splitting at inflection points ensures that the tangent slope is always increasing or decreasing along the segment, making there only be a single solution. Actually this requires also knowing that the Bézier segment doesnʼt rotate 180°, so we need to split it if it reaches its original tangents again.

Solving these issues means we can think of the equation above as giving us a function for $q$ in terms of $p$: $q(p) = \frac{-b(p) \pm \sqrt{b(p)^2 - 4a(p)c(p)}}{2a(p)}.$

This puts the functions in lock-step in terms of their tangents, giving us what we need to calculate the outside of their sweep.

Weʼll need the derive of this equation soon, so letʼs calculate it while weʼre here.

My first thought was great, we have a quadratic equation, so we know the formula and can just take the derivative of it!

This was … naïve, oh so naïve. Letʼs see why.

We have our solution here:

$q = q(p) = \frac{-b\pm\sqrt{b^2 - 4ac}}{2a}.$

So we can take its derivative $q' = q'(p)$, using the chain rule, quotient rule, product rule … oh Iʼll spare you the gory details.

$\begin{aligned} q' &= \frac{(-b'\pm\frac{2bb' - 4a' c - 4ac'}{\sqrt{b^2 - 4ac}})2a + 2a'(-b\pm\sqrt{b^2 - 4ac})}{4a^2}\\ &= \frac{-b'\pm\frac{2bb' - 4a' c - 4ac'}{\sqrt{b^2 - 4ac}}}{2a} + \frac{a'(-b\pm\sqrt{b^2 - 4ac})}{2a^2} \end{aligned}$

(Recall that $a$, $b$, and $c$ are functions of $p$, so they have derivatives $a'$, $b'$, and $c'$ in terms of that variable.)

Notice any problems?

Well, first off, itʼs an ugly, messy formula! And thatʼs even with hiding the definitions of the coefficients $a$, $b$, and $c$.

The biggest problem, though, is that everything is divided by $2a$ or $4a^2$, which means it doesnʼt work when $a = 0$. That shouldnʼt be too surprising, given that the quadratic formula also fails in that case. (Itʼs the *quadratic* formula after all, not the *quadratic-or-linear* formula!)

I mean, we could solve the linear case separately:

$bq + c = 0$ $q = \frac{-c}{b}$ $q' = \frac{-c' b + cb'}{b^2} = \frac{-c'}{b} + \frac{cb'}{b^2}$

But that also doesnʼt work; it omits the contribution of $a'$, which does in fact influence the result of the rate of change of the quadratic formula, even when $a = 0$.^{3}

So I took a deep breath, started texting my math advisor, and I rubber ducked myself into a much *much* better solution.

You see, the quadratic formula is a lie. How did we define $q$? Certainly not as a complicated quadratic formula solution. It is **really** defined as the implicit solution to an equation (an equation which happens to be quadratic): $aq^2 + bq + c = 0.$

Look, we can just take the derivative of that whole equation, even before we attempt to solve it (only takes the product rule this time!):

$a'q^2 + 2aqq' + b'q + bq' + c' = 0.$

And **this**, now *this* has a nicer solution: $q' = \frac{a' q^2 + b' q + c'}{2aq + b}.$

I think itʼs cute how the numerator is another quadratic polynomial with the derivatives of the coefficients of the original polynomial. Itʼs also convenient how we have no square roots or plusminus signs anymore – instead we write the derivative in terms of the original solution $q$.

We still have a denominator that can be zero, but this is for deeper reasons: $2aq + b = -b \pm \sqrt{b^2 - 4ac} + b = \pm \sqrt{b^2 - 4ac} = 0.$

Obviously this is zero exactly when $b^2 - 4ac = 0$. This quantity is called the discriminant of the quadratic, and controls much of its behavior: the basic property of how many real-valued solutions it has, as well as deeper number-theoretic properties studied in Galois theory.

I was looking at this and seeing $q^2$ made me think that it could be rewritten a bit, since we can solve for $q^2$ in the defining equation:

$q^2 = \frac{-c-bq}{a}.$

With some work that gives us this formula: $q' = \frac{a' q^2 + b' q + c'}{2aq + b} = \frac{(ab' - a' b)q + c' a - ca'}{a(2aq + b)},$ which is nice and symmetric (it is patterned a little like the cross product in the numerator) but not what I ended up going for, I think I was worried about floating-point precision but idk.

For fun we can see what the second derivative is, though we wonʼt end up using it! (I was scared we would need it at one point but that was caused by me misreading my code.)

$(2aq+b)q' = a' q^2 + b' q + c'$ $(2at+b)q'' + (\cancel{2a'q} + 2aq' + \cancel{b'})q' = a'' q^2 + \cancel{2a'qq'} + b'' q + \cancel{b' q'} + c''$ $q'' = \frac{- 2aq'^2 + a'' q^2 + b'' q + c''}{2aq+b}$

In which I attempt to write out what $a$, $b$, and $c$ actually are. Wish me luck.

We take the standard quadratic form of the tangent of $\mathbf{Q}$:

$\mathbf{B}'_\mathbf{Q}(q) = 3(\mathbf{Q}_1 - \mathbf{Q}_0) + 6(\mathbf{Q}_2 - 2\mathbf{Q}_1 + \mathbf{Q}_0)q + 3(\mathbf{Q}_3 - 3\mathbf{Q}_2 + 3\mathbf{Q}_1 - \mathbf{Q}_0)q^2$

(Notice how there is a constant term, a term multiplied by $q$, and a term with $q^2$.)

We want to wrangle this cross product of that with $\mathbf{B}'_\mathbf{P}(p)$ into a quadratic equation of $q$: $\mathbf{B}'_\mathbf{Q}(q) \times \mathbf{B}'_\mathbf{P}(p) = a(p)q^2 + b(p)q + c(p)$

So by distributivity, each coefficient we saw above is cross-producted with $\mathbf{B}'_\mathbf{P}(p)$ to obtain our mystery coefficients: $\begin{aligned} a(p) &= 3(\mathbf{Q}_3 - 3\mathbf{Q}_2 + 3\mathbf{Q}_1 - \mathbf{Q}_0) \times \mathbf{B}'_\mathbf{P}(p),\\ b(p) &= 6(\mathbf{Q}_2 - 2\mathbf{Q}_1 + \mathbf{Q}_0) \times \mathbf{B}'_\mathbf{P}(p),\\ c(p) &= 3(\mathbf{Q}_1 - \mathbf{Q}_0) \times \mathbf{B}'_\mathbf{P}(p).\\ \end{aligned}$

You *could* expand it out more into the individual components, but it would be painful, not very insightful, and waste ink.

Note that this vector cross-product *cannot* be cancelled out as a common term, because the $\mathbf{Q}$-vector coefficients control how the separate components of $\mathbf{B}'_\mathbf{P}(p)$ are mixed together to create the coefficients of $a$, $b$, and $c$. However, it could be divided by its norm without a problem (that is, only the direction of $\mathbf{B}'_\mathbf{P}(p)$ matters, not is magnitude – and this is by design.)

Now that we have a formula for $q$ in terms of $p$, we can just plug it in and get our whole curve.

$\mathbf{C}_{\mathbf{P},\mathbf{Q}}(p) = \mathbf{B}_\mathbf{P}(p) + \mathbf{B}_\mathbf{Q}(q(p)).$

Now for the main question: is this a Bézier curve? Nope!

Even if $\mathbf{Q}$ was a quadratic Bézier curve, the solution $q(p)$ would still be a rational function, which is not compatible with the polynomial structure of Bézier curves.

That means we canʼt just stick the curve into an SVG file or similar graphics format, its true form is not natively supported by any graphics libraries. (And for good reason, because itʼs kind of a beast!)

However, we know a lot of information about the curve, and we can use it to reconstruct a decent approximation of its behavior, meaning all is not lost.

We now know an exact formula for the idealized $\mathbf{C}_{\mathbf{P}, \mathbf{Q}}$. We can use this to get some key bits of information that will allow us to construct a good approximation to its behavior.

In particular, we want to know the slope at the endpoints and also the curvature at the endpoints. The curvature is the complicated part.

Itʼs going to get verbose very quickly, so letʼs trim down the notation a bit by leaving $p$ implicit, focusing on $t = q(p)$, and remove the extraneous parts of the Bézier notation:

$\mathbf{C} = \mathbf{P} + \mathbf{Q}(q).$

By construction, the slope at the endpoints is just the slope of $\mathbf{P}$ at the endpoints: $\mathbf{C}' = \mathbf{P}' + \mathbf{Q}'(q)q' \parallel \mathbf{P}',$ since those vectors are parallel by the construction of $q(p)$: $\mathbf{P}' \parallel \mathbf{Q}'(q).$

The curvature is a bit complicated, but we can work through it and just requires applying the formulas, starting with this formula for curvature of a parametric curve:

$\mathbf{C}^\kappa = \frac{\mathbf{C}' \times \mathbf{C}''}{\|\mathbf{C}'\|^3}.$

We already computed $\mathbf{C}'$ above, so we just need to compute $\mathbf{C}''$ and compute the cross product. $\mathbf{C}'' = \mathbf{P}'' + \mathbf{Q}''(q)q'^2 + \mathbf{Q}'(q)q''.$

However, I promised that we wouldnʼt need the second derivative $q''$, so letʼs see how it cancels out in the cross product. With some distributivity we can expand it:

$\begin{aligned} \mathbf{C}' \times \mathbf{C}'' && = &\ (\mathbf{P}' + \mathbf{Q}'(q)q')\\ && \times &\ (\mathbf{P}'' + \mathbf{Q}''(q)q'^2 + \mathbf{Q}'(q)q'')\\\\ && = &\ (\mathbf{P}' + \mathbf{Q}'(q)q') \times \mathbf{P}''\\ && + &\ (\mathbf{P}' + \mathbf{Q}'(q)q') \times \mathbf{Q}''(q)q'^2\\ && + &\ \cancel{(\mathbf{P}' \times \mathbf{Q}'(q))}q''\\ && + &\ \cancel{(\mathbf{Q}'(q) \times \mathbf{Q}'(q))}q'q''\\\\ && = &\ \mathbf{C}' \times (\mathbf{P}'' + \mathbf{Q}''(q)q'^2).\\ \end{aligned}$

Obviously $\mathbf{Q}'(q) \times \mathbf{Q}'(q) = 0$, since those are parallel, being the same vector. But $\mathbf{P}' \times \mathbf{Q}'(q) = 0$ as well, since those are parallel by construction of $q = q(p)$. That means we do not need to deal with the second derivative $q''$.

Now we can get down to business. How do we find the best Bézier curve to fill in for the much more complicated curve $\mathbf{C}_{\mathbf{P},\mathbf{Q}}$ that combines the two curves $\mathbf{P}$ and $\mathbf{Q}$?

Weʼll take six (6) pieces of data from $\mathbf{C}_{\mathbf{P},\mathbf{Q}}$:

- The endpoints: $f_0$, $f_1$ (x2),
- The tangents at the endpoints: $d_0$, $d_1$ (x2), and
- The curvature at the endpoints: $\kappa_0$, $\kappa_1$ (x2).

This should be enough to pin down a Bézier curve, and indeed there is a way to find cubic Bézier curves that match these parameters.

We will be following the paper High accuracy geometric Hermite interpolation by Carl de Boor, Klaus Höllig, and Malcolm Sabin to answer this question. The basic sketch of the math is pretty straightforward, but the authors have done the work to come up with the right parameterizations to make it easy to compute and reason about.

The bad news is we end up with a quartic (degree 4) equation, to solve the system of two quadratic equations. So we see that there can be up to 4 solutions. But we can narrow them down a bunch, like if there are solutions with loops (self-intersections) we can rule them out, or other outlandish solutions with far-flung control points.

For example, one can take these same datapoints from a real cubic Bézier curve and reconstruct its control points from those six pieces of information. In our case, we are hoping that the curves we come across, although not technically being of that form, are very close and will still produce a similar curve to the perfect idealized solution.^{4}

In fact, one cool thing about this implementation is that we can use it to find the closest Bézier curve *without a loop* to one *with a loop*. (And the reverse, though I have not implemented that.)

Basically a lot of shuffling variables around.

We will be solving for $\delta_0$ and $\delta_1$, which scale the control handles along the predefined tangents, giving these Bézier control points: $\begin{aligned} b_0 = f_0,\ &\ b_1 = b_0 + \delta_0 d_0,\\ b_3 = f_1,\ &\ b_2 = b_3 - \delta_1 d_1.\\ \end{aligned}$

Now we compute the curvature at the endpoints for this Bézier curve: $\begin{aligned} \kappa_0 &= 2d_0 \times (b_2 - b_1)/(3\delta_0^2),\\ \kappa_1 &= 2d_1 \times (b_1 - b_2)/(3\delta_1^2).\\ \end{aligned}$

And with these substitutions, $\begin{aligned} f_0 - f_1 &=: a,\\ b_2 - b_1 &=: a - \delta_0 d_0 - \delta_1 d_1,\\ \end{aligned}$ we get a system of two quadratic equations for $\delta_0$ and $\delta_1$: $\begin{aligned} (d_0 \times d_1)\delta_0 = (a \times d_1) - (3/2)\kappa_1 \delta_1^2,\\ (d_0 \times d_1)\delta_1 = (d_0 \times a) - (3/2)\kappa_0 \delta_0^2.\\ \end{aligned}$

Itʼs easy to deal with the case when $d_0 \times d_1 = 0$ (that is, when the starting and ending tangents are parallel). For the nonzero case, we reparameterize again according to:

$\begin{aligned} \delta_0 &=: \rho_0 \frac{a \times d_1}{d_0 \times d_1},\\ \delta_1 &=: \rho_1 \frac{d_0 \times a}{d_0 \times d_1}.\\ \end{aligned}$ $\begin{aligned} R_0 &:= \frac{3}{2}\frac{\kappa_0 (a \times d_1)}{(d_0 \times a)(d_0 \times d_1)^2},\\ R_1 &:= \frac{3}{2}\frac{\kappa_1 (d_0 \times a)}{(a \times d_1)(d_0 \times d_1)^2}.\\ \end{aligned}$

Thus we end up with the very pretty system of quadratics: $\begin{aligned} \rho_0 = 1 - R_1 \rho_1^2,\\ \rho_1 = 1 - R_0 \rho_0^2.\\ \end{aligned}$

We can solve for one of these variables, by substituting from the other equation, $\begin{aligned} \rho_0 &= 1 - R_1 (1 - R_0 \rho_0^2)^2\\ &= 1 - R_1 + 2R_0R_1\rho_0^2 - R_0^2 R_1 \rho_0^4.\\ \end{aligned}$

This is a depressed quartic equation, with coefficients $[1 - R_1,\ -1,\ 2R_0R_1,\ 0,\ -R_0^2R_1]$.

We want solutions with $\delta_0, \delta_1 \ge 0$ – that is, with the control handles pointing the correct way. We also want to generally minimize those variables too, otherwise there are outlandish solutions with huge coefficients (particularly ones with loops). Finally I also added a check that ensures we are getting the correct curvature out of them – for some reason I was getting solutions with flipped curvature.

Itʼs actually really pretty to see solutions with all signs of curvatures together:

I have an implementation in vanilla JavaScript of the algorithm described in this post.

Of course it needs some basic theory of vectors and polynomials and Bézier curves. For example, `bsplit(points, t0)`

returns a vector of two new Bézier curves that cover intervals $[0, t_0]$ and $[t_0, 1]$ of the input curve, respectively.

The important functions in `calligraphy.js`

are as follows:

`compositeI(P,Q)`

computes the approximate Bézier convolution of`P`

with`Q`

.`PQ_CURVATURE(P,Q)(p,q=T_SOL(P,Q)(p))`

computes the curvature of the exact convolution between`P`

and`Q`

at`(p,q)`

(where`q`

should be the point on`Q`

corresponding to`p`

on`P`

, i.e. parallel).`INFLXNS(P)`

computes the inflection points of`P`

.

And the full algorithm is put together (with visualization) in `minkowski.js`

:

`doTheThing(p1, p2)`

does the thing, as it says.`splitPathAtInflections(p2)`

removes pathologies from the pen nib.`splitBezierAtTangents(c1, getTangentsL(c2))`

splits the pen path according to the points of interest of the pen nib.`getTangentsL(c2)`

includes both the control point tangents of the Bézier as well as the beginning-to-end tangent, to make the tangent function injective on the length of each segment.?? ??`doTheTask(c1, c2)`

creates a patch or two from two Bézier curves, which will either be the “parallelogram” formed by translating them, or will include their convolution if it exists.

Uhhh … I allowed the control points to go backwards ($\delta_0 * \delta_1 < 0$) and I perturbed the tangential splits due to numeric issues. The latter could be fixed by actually remembering those splits and not trying to solve $q(p)$ there.

Throughout this post, Iʼve been emphasizing “the pen nib this” and “the pen path that” for the sake of giving you a concrete image in your mind. But the reality of the underlying math is that there is no fundamental distinction between the two curves. The convolution and Minkowski sum are commutative, and do not care which curve is which.

My primary sources/inspiration:

- The wonderful, fabulous, extraordinary Primer on Bézier Curves by Pomax.
- Fitting cubic Bézier curves, How long is that Bézier?, and other posts by the excellent Raph Levien.
- Linked from the above, my main reference that convinced me it was possible and made my life easier (except for the hour I spent chasing down a minus sign I forgot): High accuracy geometric Hermite interpolation by Carl de Boor, Klaus Höllig, and Malcolm Sabin.

Other miscellanea on Bézier curves:

- The Uniqueness of the Rational Bézier Polygon is Unique by Javier Sánchez-Reyes
- The problem with cubic bezier curvature and related StackExchange post
- Special cases to arc length reparameterization of a cubic Bézier
- the term for the third-derivative analog of curvature for curves is “aberrancy”
- Bézier curvature extrema
- … can time-parameterization cut down on the configuration space of cubic curves too?

The interactive widgets here will allow you to build and verify your intuition by clicking through examples, because I believe that once you are armed with the basic ideas and the right intuition, you can figure out the rest of details for yourself.

Alternatively, it can serve as a playground to test out hypotheses about grammars, see exactly where things go wrong when conflicts occur, and what to do to fix those errors.^{2}

My real goal is to show you how intuitive LR(1) parsing can be! Thatʼs right, thereʼs nothing too advanced or magical going on, just some intuitive ideas carried to their logical conclusions. I used to be scared of parser generators, but once I was introduced to them I followed this approach to understand them for myself and I hope to share that with you.

As (functional) programmers, weʼre used to learning topics in terms of the appropriate datatypes and operations for the job, and thatʼs what I will go through for you here. Hint: weʼll use a lot of monoids!

```
data ShiftReduce s r
= Shift s
| Reduces (NonEmptyArray r)
| ShiftReduces s (NonEmptyArray r)
instance Semigroup (ShiftReduce s r) where
-- We do not expect to see two shifts, so arbitrarily prefer the first one
append (Shift s) (Shift _) = Shift s
append (Shift s) (ShiftReduces _ rs) = ShiftReduces s rs
append (ShiftReduces s rs) (Shift _) = ShiftReduces s rs
append (ShiftReduces s rs) (ShiftReduces _ rs') = ShiftReduces s (rs <> rs')
append (Shift s) (Reduces rs) = ShiftReduces s rs
append (Reduces rs) (Shift s) = ShiftReduces s rs
append (Reduces rs) (Reduces rs') = Reduces (rs <> rs')
append (ShiftReduces s rs) (Reduces rs') = ShiftReduces s (rs <> rs')
append (Reduces rs) (ShiftReduces s rs') = ShiftReduces s (rs <> rs')
```

Thatʼs my big complaint: thereʼs too many numbers floating around in traditional explanations of LR(1) tables, without any indication of what they mean or how they tie together. So I have used semantic formatting to indicate what they all mean: numbered states 0, 1, 2 are highlighted differently from named rules 0, 1, 2 and differently from number tokens 0, 1, 2. Hopefully the pretty colors will keep your attention!

All of the mechanical steps in generating LR(1) parsers will be broken down and their motivation explained. What does it mean to “close” a state? How do you know what states are next, and when are you done? Why are “reduce–reduce” conflicts bad? Stay tuned to find out!

Skip to Widgets below to start using! Or click this button to see them on their own:

Right now I donʼt have much helpful content below, but I will slowly add more posts. I am finally ready to start writing explanatory content now, after working a lot on the behind-the-scenes code to animate it.

EDIT: this project kind of stalled, sorry. Iʼm still thinking about parsers a lot, but not in this introductory way. You can bug or help me write more ^^

- TODO: Using this tool by example
- TODO: Terminology reference
- WIP: Basics: What are grammars (BNF, RegExp)
- Nonterminals and terminals
- Sequencing and alternation (regexes)

- WIP: Uses of grammars:
- Generators: nondeterministically generate strings in the grammar by following the rules as state transitions
- Recognition: recognize which strings belong to the grammar and which do not
- Syntax highlighting: cursed.
- Parsing: find an unambiguous parse tree for inputs that belong to the grammar

- WIP: Basics of LR(1) Parsing
- States
- State transitions
- Closure of states
- Lookahead

- Precedence
- Refresher on operator precedence
- Operator precedence mapped to LR(1) table parsing
- Conflict resolution using precedence operators à la Happy.

- Grammars as datatypes
- AST/CSTs as ADTs
- Data associated with tokens
- Finding perfect representations, e.g. no leading zeroes, if you want it to encode data exactly
- Common practice of using grammars this way (e.g. in type theory papers)

- State exploration: current state, next states, previous states
- Emulate Happy, especially precedence operators
- Can Happy precedence operators be explained in terms of alternative grammars? Can the conflict resolutions always be “pulled back” to a grammar that would generate the same table, or a larger table that represents the same abstract grammar? Does it require quotiented grammars to explain?
- Generate example inputs for each state, especially to diagnose conflicts
^{3} - Explain Earley parsing using a similar approach
- Better errors!

Craft a grammar out of a set of rules. Each rule consists of a nonterminal name, then a colon followed by a sequence of nonterminals and terminals. Each rule must also have a unique name, which will be used to refer to it during parsing and when displaying parse trees. If ommitted, an automatic rule name will be supplied based on the nonterminal name.

The top rule is controlled by the upper input boxes (LR(1) grammars often require a top rule that contains a unique terminal to terminate each input), while the lower input boxes are for adding a new rule. The nonterminals are automatically parsed out of the entered rule, which is otherwise assumed to consist of terminals.

Click “Use grammar” to see the current set of rules in action! It may take a few seconds, depending on the size of the grammar and how many states it produces.

This will randomly generate some inputs that conform to the grammar. Click on one to send it to be tested down below!

Each rule can be read as a transition: “this nonterminal may be replaced with this sequence of terminals and nonterminals”. Build a tree by following these state transitions, and when it consists of only terminals, send it off to be parsed below!

Text entered here (which may also be generated by other widgets below) will be parsed step-by-step, and the final parse tree displayed if the parse succeeded. (Note that the closing terminal is automatically appended, if necessary.) Check the state tables above to see what state the current input ends up in, and the valid next terminals will be highlighted for entry.

To construct the LR(1) parse table, the possible states are enumerated. Each state represents partial progress of some rules in the grammar. The center dot “•” represents the dividing line between already-parsed and about-to-be-parsed.

Each state starts from a few seed rules, which are then closed by adding all nonterminals that could be parsed next. Then new states are explored by advancing on terminals or nonterminals, each of which generates some new seed items. That is, if multiple rules will advance on the same (non)terminal, they will collectively form the seed items for a state. (This state may have been recorded already, in which case nothing further is done.)

When a full rule is parsed, it is eligible to be reduced, but this is only done when one of its lookaheads come next (highlighted in red).

Once the states are enumerated, the table of parse actions can be read off:

Terminals can be “shifted” onto the stack, transitioning to a new state seeded by pushing through that terminal in all applicable rules in the current state.

Completely parsed rules will be “reduced” when their lookahead appears, popping the values matching the rule off of the stack and replacing it with the corresponding nonterminal, which then is received by the last state not involved in the rule.

Nonterminals received from another state trigger “gotos” to indicate the next state.

Two types of conflicts may occur: if a terminal indicates both a shift and reduce actions (shift–reduce conflict) or multiple reduce actions (reduce–reduce conflict). Note that there cannot be multiple shift actions at once, so most implementations (including this one) choose to do the shift action in the case of shift–reduce conflict.

This is a post on constructing real numbers *without* constructing rational numbers. Along the way the rationals will sort of be constructed, or at least heavily implied, but they donʼt directly figure into the definition! Instead all we need is functions from integers to integers.

The construction weʼll be talking about is my favorite esoteric construction of real numbers: the *Eudoxus real numbers*.

I want to explore *why* this definition works, not merely recite its definition and properties. So letʼs go on a journey together through various properties of linear and almost linear functions, and learn how to leverage “wrong” definitions of slope to create the real numbers!

*If you donʼt have any background in real analysis or constructions of the real numbers, thatʼs okay. Bring along your intuitions nonetheless, they will aid you greatly! We will go over the necessary details as we get to them, although you may miss some of the depth of the discussion when I reference those other things. Just know that thereʼs a lot of strange and marvelous things going on that make constructing the real numbers an essential part of mathematics. Are these marvelous properties of the real numbers all coincidences? Surely not* 😉.

Like all constructions of fundamental objects, we need to begin with our existing intuition about what real numbers are in order to construct them out of more basic building blocks. Just like the integers are built from natural numbers, and rational numbers are built from integers, real numbers need to be constructed from natural numbers, integers, and/or rationals. In the case of real numbers, we will need two ingredients: approximations, and a lot of them – infinitely many, in fact.

The key property of approximations is that they need to be *easier to grasp* than the thing we are approximating, otherwise it doesnʼt buy us any leverage! In our circumstance in particular, each individual approximation should contain a *finite amount of data*, and then we stitch together an *infinite number of approximations* to represent the complicated object we are constructing: a real number.

A convenient way to say that something has a finite amount of data is to say it is countable. In programmer terms, we could say we want it to be serializable into a string – itʼs equivalent. Why? Well each natural number or string has a finite amount of data, it is easy to understand or describe all at once. So if some set we are studying is countable/serializable, meaning it can be understood as if it was a set of natural numbers, then it can serve as an approximation for our purposes.

Thereʼs much more to say about this kind of process, indeed I hope to write a whole nother blog post about it, but for now know that there are two main questions to ask:

- When are approximations consistent enough so they ought to represent
*a thing*(e.g. “when does a sequence of rationals converge?”), and - When do two approximations approximate the
*same*thing (e.g. “when do two convergent sequences converge to the same point?”).

For example, letʼs think about approximation generators, functions that generate approximations by taking in an error bound and returning a rational number that lies within that error bound of the real number we are trying to approximate.^{1} Because we can ask for smaller and smaller error bounds, we in fact get an infinite number of approximations that probe closer and closer to the real number. This is good: when approximating an irrational number, no single rational approximation will suffice to capture it! But not all functions will be well-behaved approximation generators, and even of those that are, there will be many ways of approximating the same number. Thatʼs why we ask those two questions: when do the approximations work, and when do they agree with each other.

Answering these two questions will let us invent real numbers. However, the process to get there wonʼt quite be *linear*: the notion of approximation will not be as clear as it was in the example above!

Letʼs dive in!

The idea of Eudoxus real numbers is that they represent a real number indirectly, via a functionʼs slope. We will cover *why* this works later, but first letʼs agree on *what slope is* so we can proceed with the construction and then analyze its properties that make it all work.

Letʼs say we have a function that takes in numbers and outputs numbers. (We donʼt need to be formal about what kind of numbers yet.)

What **definitely** has a slope? Well our favorite function from school mathematics does: $f(x) = m*x + b$. Weʼve memorized that this has slope $m$!

Why do we say it has slope $m$? What is slope? Well, weʼve also memorized that slope is “Rise Over Run”. If we take two points at $x_1$ and $x_2$, the distance we **run** from the first to the second point is $x_2 - x_1$, and the distance we had to **rise** to keep up with the function is $f(x_2) - f(x_1)$. Slope as “rise over run” is therefore $\frac{f(x_2) - f(x_1)}{x_2 - x_1}.$

The beautiful thing about $f(x) = m*x + b$ is that functions of this form have *constant slope*, no matter what $x_1$ and $x_2$ are! $\begin{align*}\frac{f(x_2) - f(x_1)}{x_2 - x_1} &= \frac{(m*x_2 + b) - (m*x_1 + b)}{x_2 - x_1} \\&= \frac{(m*x_2 - m*x_1) + \cancel{(b - b)}}{x_2 - x_1} \\&= \frac{m*\cancel{(x_2 - x_1)}}{\cancel{x_2 - x_1}} \\&= m.\end{align*}$

The inputs $x_1$ and $x_2$ disappear, meaning it traces out a line with a constant slope – itʼs why we call it a linear function. So weʼre pretty happy saying that this function has slope $m$. Tada! 🎉

This is where we have to ask what kind of numbers we are using, because that determines what $m$ can be. If the function takes in real numbers and outputs real numbers, $m$ can surely be any real number too. But tough luck – we canʼt use that to *construct* real numbers, no circular definitions allowed!

Maybe $m$ can be any rational number. Thatʼs fine – slope works in the same way. But we run into another roadblock: the function still only has one well-defined slope, and itʼs stuck being a rational number. Thereʼs no magic of infinity happening.

What if we say that $f$ is an integer function? Here we have to be careful: integer division isnʼt necessarily well-defined, **but** if we know $m$ is an integer, then it happens to work out in this case: the denominator will always evenly divide the numerator, and out pops $m$. This seems like an even worse situation, getting stuck with integers! But wait …

We need a little wiggle room. Having a constant slope is too restrictive! Can we loosen up and find something that is still like slope, but only approximately?

Having hit these roadblocks, we need some inspiration. Letʼs revisit the idea of integer functions but in a new way.

Hereʼs the thing: consider a graphing calculator, or a graphing app on a computer. What happens when we ask it to graph $f(x) = (1/\pi) * x$? (Pick your favorite irrational-but-computable real number there.)

It does some math, and shows us a line on a screen … but wait, that line isnʼt a perfect mathematical function from real numbers to real numbers, itʼs a pixelated entity: *itʼs an integer function that approximates the real function*.

What happens as we keep scrolling the screen? We could keep sampling snapshots of the screen at each moment, and see how they vary.

If we had picked an integer slope, every snapshot would appear to have the same slope. For a slope of $0$, there is never any change. For a slope of $1$, it steadily moves up pixel by pixel. For a slope of $2$, it skips a pixel as it goes. Et cetera. This isnʼt so interesting, yet!

But if we had picked a rational slope, we start to see something interesting happen: It doesnʼt always step by the same amount, sometimes it jumps more than it does other times.

For instance, a slope of $1/7$ would only jump up a single pixel every $7$ pixels. A slope of $4/3$ would jump up $2$ pixels one time, then jump up a single pixel twice more, making visible groups of three.

More complicated fractions will look more complicated and repeat over longer distances. Can you figure out what the third line graphed in this figure is? (Hint: how long does it take to repeat?)

However, this same phenomenon occurs for irrational slopes too: how do we tell the difference now??

The key is not how the jumps repeat in individual snapshots, but how they repeat across the whole domain of the function.

Do the snapshots recur in a nice pattern? If so, that must be graphing a line with rational slope! They will recur when the window has shifted by the denominator of the slope.

*If the snapshots recur irregularly, without fitting a nice pattern, it is graphing an irrational number.*

(Since there are a finite number of snapshots that will be generated for a particular line, they must recur – thatʼs not the point. The point is whether they recur on a fixed schedule.)

This is where we cross over from computation to mathematics: We canʼt hope to decide whether this is the case from a finite number of snapshots! Instead we leave it to the realm of mathematical proof to definitely establish whether a slope is rational or irrational. (We might not know!)

Okay, we are ready to take the knowledge we learned from graphing calculators and translate it back into mathematics.

We agreed that linear functions of all types have slope, but only linear functions using real numbers contained enough data to represent real numbers. However, we saw that calculators approximate real functions with integer functions – this is our hope of salvation, if we can connect the two ideas.

What, then, will tell us whether some **integer** function approximates a **real linear** function?

We said that integer-function approximations to lines with rational slopes had a key property: they recur in a consistent pattern!

If we call the period of recurrence $p$, this property is that $f(x + p) = f(x) + d$, for some $d$.

What is this mysterious $d$ though? Itʼs the **rise** for the **run** $p$. Weʼre saying that the function repeats every time we shift by $p$ steps of the $x$-axis, but it has shifted by $d$ in the $y$-axis in the meantime.

Thatʼs right, weʼve circled back to good olʼ slope, but in a new way: instead of saying that we want the function to be linear by having rise over run being constant for all $x_1, x_2$ we choose, we just want it to be constant when $x_2 - x_1$ is a known period.

This is much less restrictive: in particular, it lets us get rational slopes back out of integer functions! If we define the slope to be $d/p$, we see that for integer approximations it matches the slope of the original rational linear function, which we said would recur with period of the denominator, and of course it must travel vertically by the numerator in that time.

With how they wave around, these two functions donʼt look linear anymore, but they do follow a nice recurrence. Their shapes keep repeating every $4$ or $8$ pixels, while they climb or fall by half that each time. That is, they have slope $+1/2$ and $-1/2$ in our new system, even though they are not linear.

We hit a roadblock with irrational slopes, though. If we donʼt have a period to work with, how can we get our hands on a slope from an integer function??

The answer is … [drumroll] … **we donʼt!!!!**

Huh?!?! 🤨😧😱

What. Donʼt give me that look. 😄😌😇

Look here: we said linear functions were boooring. Linear functions were just their slope! pluuus a second term that merely shifts it vertically, as if that matters. Yawn 😴😪.

However, now weʼre going to use this correspondence as leverage to do something *truly* exciting: represent slopes **as the functions weʼre given!**

We donʼt need an independent notion of slope if we can isolate the properties that *arenʼt* slope in order to distill the functions down into their slopiest essence. 🍾✨

As we just said, shifting by a constant amount vertically will not impact slope. However, we can do infinitely better than this.

Recall that weʼre looking at integer functions through their whole domain, not just at a single slice anymore.

Say the calculator made an error and calculated a pixel incorrectly. (The horror! Quick – letʼs blame it on cosmic radiation.)

Should this *single error* affect the slope? Nope! Itʼll quickly get washed away in the *infinite other data* that informs us what the slope is.

Should another error affect the slope? Nahhh. Should any finite amount of errors affect the slope? Not even!

Thus we can say that shifting vertically doesnʼt influence the slope and any finite number of errors wonʼt influence it either.

Putting these together, we can wrap them up as a single property, and with a little extra sprinkle on top: *any two functions that differ by a bounded amount represent the same slope!*

Finally we get to identify what slope actually is. Took us long enough!

Hereʼs the key property: we want to ask that the quantity $d_f(x_1, x_2) = f(x_1) + f(x_2) - f(x_1 + x_2)$ is *bounded* for the integer functions we are considering.

The nLab calls this property “almost linear”. Letʼs also call $d_f(x_1, x_2)$ the “wiggle” of the function $f$ between $x_1$ and $x_2$.

As a first check, letʼs see that this happens for actually linear functions $f(x) = m*x + b$: $\begin{align*}&\ f(x_1) + f(x_2) - f(x_1 + x_2) \\=&\ (m*x_1 + b) + (m*x_2 + b) - f(x_1 + x_2) \\=&\ (m*(x_1+x_2) + 2*b) - f(x_1 + x_2) \\=&\ (\cancel{m*(x_1+x_2)} + \cancel{2}*b) - (\cancel{m*(x_1 + x_2)} + \cancel{b}) \\=&\ b.\end{align*}$

Whoa, itʼs not only bounded, itʼs actually constant! And the constant is the vertical shift we said we didnʼt care about, interesting.

What about for integer approximations of linear functions?

First we can look at the linearly-periodic approximations of rational linear functions: saying the period is $p$, we mean that $f(x + p) = f(x) + d$ for all $x$. So as one example, if we pick $x_2 = p$, then for any $x_1$, the quantity weʼre looking at $f(x_1) + f(p) - f(x_1 + p)$ is just $f(p) = f(0)$. Constant, thus bounded. (Note how nicely this matches up with the perfectly linear case, since $f(0) = b$ then!)

We can play this game again: $f(x_1) + f(2*p) - f(x_1 + 2*p) = f(2*p) = f(0)$ as well.

But what about other values of $x_2$ that donʼt fit into the period? The key is that it $f(x_1) + f(x_2) - f(x_1 + x_2)$ is not just linearly-periodic, but actually periodic in both $x_1$ and $x_2$ now, having the same period $p$ of course:

$\begin{align*}&\ f(x_1) + f(x_2 + p) - f(x_1 + (x_2 + p)) \\=&\ f(x_1) + (f(x_2) + d) - (f(x_1 + x_2) + d) \\=&\ f(x_1) + f(x_2) - f(x_1 + x_2).\end{align*}$

The $d$ cancels out from the two terms that changed.

The upshot is that we donʼt need to worry about *all* values of $x_2$, just one period worth of it, from $1$ to $p$. Explicitly, the wiggle of the function is bounded by the quantity $\max(|f(1)|, |f(2)|, \dots, |f(p)|).$

We didnʼt identify a property for abstractly looking at approximations of linear functions with irrational slopes, like we did with approximations of rational slopes. However, if we look at a typical approximation function, we know that, with whatever integer rounding it is using, its integer value will be within $\pm 1$ of the actual real value at any given point. So when you look at the three terms $f(x_1)$, $f(x_2)$, and $f(x_1 + x_2)$, it is clear that the wiggle will be within $\pm 3$ of the value of the wiggle of the underlying linear function, which we know is constantly $b$. So the wiggle is clearly bounded.

Having satisfied our curiosity that we have identified the necessary ingredients, letʼs finally define the actual mathematical object and the operations on it.

The **Eudoxus real numbers** are **almost linear functions** but where two functions are considered the same when they have a **bounded difference**.

**Almost linear functions**have bounded wiggle, which is our name for the quantity $d_f(x_1, x_2) = f(x_1) + f(x_2) - f(x_1 + x_2)$. This property should make it possible to say this function has a slope!**Bounded difference**just means that $f(x) - g(x)$ is bounded. This property eliminates what data each function carries that is extraneous to our goal of defining slopes.

We can add slopes in the obvious way, by adding the functions together (pointwise!): $(f+g)(x) = f(x) + g(x)$. And we can negate pointwise $(-f)(x) = -f(x)$, and the zero for this is of course the constantly zero function: $0(x) = 0$. For example, we can add slope $1/6$ to slope $-1/2$ to get slope $-1/3$:

Huh, that middle line doesnʼt look like the graph of the line with slope $-1/3$! Itʼs very flakey, like it almost canʼt decide whether it wants to go up or down. Still, it looks like it does have the right trend overall, a general slope of $-1/3$, and we will prove this later: addition works out just fine, within the bounded wiggle we expect!

Now we get to my favorite part: how to multiply slopes.

Multiplying the functions pointwise is the wrong move: that would produce something like quadratic functions, or worse.

That is a sad looking parabola indeed ☹️.

Hmm … oh wait! What about *composing* functions?

If you compose two functions, you multiply their slopes! So we have $(f*g)(x) = f(g(x))$. This suggests that the identity function acts as the identity for this multiplication, $1(x) = x$ (of course it has slope $1$!).

The goal is to come up with an ordering on almost-linear functions that is consistent with the arithmetic and the equivalence relation. So we can take a shortcut: we only need to define what it means to be positive, then $f > g$ is defined by the difference $f - g$ being positive.

Before we ask what it means to positive or negative, what does it mean to be zero? We said that zero was the constant function – but wait, itʼs actually the *equivalence class* of the constant function, which consists of all functions that are bounded (since that means they have bounded difference from zero).

Positive and negative, then, are going to be unbounded functions. How do we distinguish them?

The basic idea is that positive will grow unboundedly *positive* to the right and negative will grow unboundedly *negative* to the right. In formal terms, we say that $f$ is positive if, upon picking some arbitrary height $C$, it is eventually always above $C$ as it goes towards the right: i.e. there is some $N$ such that for all $m > N$, $f(m) > C$.

There are some properties to verify.

First off, why is it sufficient to just consider the behavior of the function to the right (i.e. for $m > 0$)? We would need to know that growing unboundedly positive to the right is the same as growing unboundedly negative to the left. This conforms with our intuition of how almost linear functions have slope, but it requires a proof.

The other properties of how the ordering interacts with arithmetic are straightforward, however:

- Negation works as expected: $-f > -g$ holds exactly when $g > f$ holds, since the first means $(-f) - (-g)$ is positive, and the second means $g - f$ is positive.
- Addition works as expected: $f_1 + f_2 > g_1 + g_2$ if $f_1 > g_1$ and $f_2 > g_2$, since $(f_1 + f_2) - (g_1 + g_2) = (f_1 - g_1) + (f_2 - g_2)$, and obviously the sum of two positive functions is still positive.
- Multiplying by a positive factor works as expected: $f * h > g * h$ if $f > g$ and $h > 0$ holds by distributivity $f * h - g * h = (f - g) * h$ and multiplication of positives is obvious too.

The million dollar question: does it really have a slope?

To determine this, weʼll basically use Cauchy real numbers. Recall that those are equivalence classes of sequences of rational numbers. Can we construct a Cauchy real number from *our* definition of real numbers, to capture the slope of our almost-linear functions?

Recall our definition of slope as rise over run. We should be able to pick two arbitrary points and compute their slope, right? It will be a rational number, since weʼre dealing with integer functions. And then presumably as those points get further and further away from each other, the approximation will get better and better.

We might as well pick $0$ as our basepoint and move out to positive numbers, like so: $\lim_{n \to \infty} \frac{f(n) - f(0)}{n - 0}.$

Except notice that $f(0)$ is a constant in the numerator, and since the denominator grows to $\infty$, some basic properties of limits tell us that it is irrelevant, so we can instead use this limit: $\lim_{n \to \infty} f(n)/n.$

In fact this realization is the key step that informs us that the limit is well-defined on equivalence classes: any bounded difference between two functions will not affect this limit. So we could also assume that $f(0) = 0$ without loss of generality, since that will be in the same equivalence class.

Now we just need to show that it satisfies the Cauchy condition, assuming $f$ is almost linear:

$\forall \epsilon > 0, \exists N > 0, \forall n, m > N, \\ \left|f(n)/n - f(m)/m\right| < \epsilon.$

We will go through what this means including a proof later, since it requires more machinery.

But weʼve half-answered our question already: we have done what we can to isolate slope from the other data the functions carry, and it just remains to confirm that we can in fact define slope from it.

Thereʼs another obvious property that we need slopes to be invariant under:

Weʼve covered vertical shifting, but what about horizontal shifting? Does bounded (vertical) differences suffice to cover bounded (horizontal) differences? Obviously not for arbitrary functions, but hopefully for the almost-linear functions we are covering?

It turns out yes, being almost-linear is exactly what it means that horizontal shifts of the function require just bounded vertical shifts to compensate: we need $f(n) - f(n+c)$ bounded (for fixed $c$), but being almost linear means that $f(c) + f(n) - f(n+c)$ is bounded which is different just by the constant $f(c)$.

For example, these two functions with slope $3/7$ are shifted by $6$ relative to each other, so their different forms a pretty checkerboard-like pattern, which is clearly bounded, only taking the values $2$ and $3$ as the functions grow in close proportion.

Thereʼs one property we have not mentioned yet: monotonicity.

All of the linear approximation functions we would come up with are monotonic: either for all $m \le n$, $f(m) \le f(n)$ (weakly monotonically increasing), or for all $m \le n$, $f(m) \ge f(m)$ (weakly monotonically decreasing). But weʼve never mentioned this. Why not include it as a property we care about having or respecting?

The first clue is what we saw above with addition: when we added two monotonic representative functions (with slope $1/6$ and $-1/2$), the result wasnʼt monotonic anymore, it went up and down and up and down, although in a nicely repeating pattern. So our naïve definition of addition did not preserve monotonicity.^{4}

However, you can in fact take any representative function and make it monotonic by only a bounded difference – but with one big catch: you have to know up front whether the slope is zero, positive, or negative,— and that property is undecidable.

So it just seems more fussy and complicated to require the representative functions be monotonic, even though it could be done without resulting in a different theory.

All of the lemmas involving addition are straightforward, so I wonʼt go through them here, but feel free to ask me if you have questions!

This is my **favorite** theorem, and the reason I like this construction so much.

Distributivity is what lets you pull out common factors, like so: $(f+g)*h = f*h + g*h$.

Watch what happens when we use our definitions of these operators (recall that multiplication is function composition):

$((f+g)*h)(x) = (f+g)(h(x)) = f(h(x)) + g(h(x)).$

Thatʼs right, it falls out for free based on how we defined multiplication and addition, just because of how functions work! Isnʼt that so cool?

But … thereʼs a catch.

Proving that multiplication *on the left* distributes over addition is not so straightforward: $(h*(f+g))(x) = h((f+g)(x)) = h(f(x) + g(x)) =\ ??$

There doesnʼt seem to be a direct way to attack this. However, once we prove that multiplication is commutative, it doesnʼt matter anymore. (In fact, I suspect that proving left-distributivity directly requires similar arguments to the commutativity of multiplication anyways.)

Thereʼs two aspects to show that the multiplication is well-defined: since weʼre dealing with subquotients, we need to show that the result satisfies the necessary property, and also that it respects the quotienting relation.

I think this first aspect requires one of the more tricky proofs: $f*g$ is almost linear if $f$ and $g$ are. (Remember we defined multiplication as the composition of those functions, not pointwise multiplication!)

Assume $f(m+n) - f(m) - f(n)$ and $g(m+n) - g(m) - g(n)$ are bounded, then show that $f(g(m+n)) - f(g(m)) - f(g(n))$ is bounded as well.

For the set-up, we observe that $f(a(m,n)+e(m,n)) - f(a(m,n)) - f(e(m,n))$ is bounded (by assumption), and we choose strategic functions $a(m)$ and $e(n)$.

In particular, if $e(m,n)$ is bounded by $b$, then $f(e(m,n))$ is clearly also bounded, thus so is $f(a(m,n)+e(m,n)) - f(a(m,n))$ – namely, by the maximum of $f$ on the interval $[-b,b]$.

What quantity do we know is bounded? Letʼs choose $e(m,n) = g(m)+g(n)-g(m+n)$.

For the other function we choose $a(m,n) = g(m+n)$, which makes their sum $a(m,n)+e(m,n) = g(m)+g(n)$.

Plugging these in, we see that $f(g(m)+g(n)) - f(g(m+n))$ is bounded. But $f(g(m)+g(n)) - f(g(m)) - f(g(n))$ is also bounded by the first assumption. Thus their difference $f(g(m+n)) - f(g(m)) - f(g(n))$ is bounded. Q.E.D.

We also need that if $f_1 \sim f_2$ and $g_1 \sim g_2$ then $f_1*g_1 \sim f_2*g_2$. We can decompose this into two steps, varying one side at a time: $f_1 \sim f_2 \implies f_1*g \sim f_2*g$ and $g_1 \sim g_2 \implies f*g_1 \sim f*g_2$.

The first is trivial: if $f_1(x) - f_2(x)$ is bounded, of course $f_1(g(x)) - f_2(g(x))$ is also bounded, just by properties of functions!

The second step also makes sense, but is trickier to formalize: if $g_1(x) - g_2(x)$ is bounded, then $f$ being almost linear should preserve this bounded difference. Itʼs not like $f$ can stray too far away so as to make an unbounded difference on closely bounded inputs.

So how do we knead the almost-linear condition into a form that tells us about $f(g_1(x)) - f(g_2(x))$? Well, what it looks like now is: $f(m+n) - f(m) - f(n)\ \text{bounded},$ and to ask about the *difference* of $f$ at certain input, we want to pick $m+n = g_1(x)$ and $m = g_2(x)$, which makes $n = g_1(x) - g_2(x)$, giving us: $f(g_1(x)) - f(g_2(x)) - f(g_2(x) - g_1(x))\ \text{bounded}.$ But weʼre done now, since $g_2(x) - g_1(x)$ is bounded, making its image under $f$ bounded, so using the above fact, $f(g_1(x)) - f(g_2(x))$ is really bounded as we wanted.

Following Arthanʼs exposition, we will need some more machinery before we tackle the rest of the proofs. Using the tools of mathematical analysis we will establish further properties that capture the almost-linearity of functions, using the existing property of bounded wiggle.

Let $C$ be a bound for the wiggle of the function; i.e. $d_f(p, q) < C$ for all $p$ and $q$.

The first lemma^{5} we want to establish is the following^{6}: $|f(pq) − pf(q)| < (|p| + 1)C.$

We need to use induction on $p$. This wasnʼt obvious to me at first, but it makes sense in retrospect!

For the base case $p = 0$, it is easy: $|f(0) - 0| = |f(0)| = |f(0 + 0) - f(0) - f(0)| = |d_f(0, 0)| < C.$ However, letʼs rewrite it using the point $q$, to make the next steps of induction clearer: $|f(0)| = |f(0 + q) - f(0) - f(q)| = |d_f(0, q)| < C.$

Now we need to establish that if it holds for $p$, then it holds for $p+1$: $|f(pq) - pf(q)| < (|p| + 1)C \implies\\ |f(pq + q) - pf(q) - f(q)| < (|p| + 2)C.$

How do we get from $f(pq) - pf(q)$ to $f((p+1)q) - (p+1)f(q)$? The difference is $f((p+1)q) - f(pq) - f(q)$: but this is actually $d_f(pq, q)$, so it is also less than $C$ in absolute value.

So this means that each step we take changes it by less than $C$, and weʼre done by induction^{7}. We can write it all out in one step like this: $|d_f(0, q) + d_f(q, q) + d_f(2q, q) + \cdots + d_f(pq, q)| < (|p|+1)C.$

Using that lemma, we get to the next lemma^{8} in a very slick manner: $|pf(q) − qf(p)| < (|p| + |q| + 2)C.$

Our task here is to compare $pf(q)$ and $qf(p)$. The magic of the above lemma is that it compared these values to a *common* value: $f(pq) = f(qp)$. So we have what we need now: $|pf(q) - qf(p)| \le |pf(q) - f(pq)| + |f(qp) - qf(p)| < (|p| + 1 + |q| + 1)C.$

While weʼre here, we need one more lemma^{9} that certifies that $f$ acts like a linear function, approximately. We show that there are some constants $A$ and $B$ such that the following holds for all $p$: $|f(p)| < A|p| + B.$

Notice that this also says that $f$ is behaves like a linear: it cannot grow outside the confines of the linear function $A|p| + B$ (although those $A$ and $B$ may be large numbers, thus providing only loose bounds on $f$).

It comes immediately from the above lemma: take $q = 1$ to get $|pf(1) - f(p)| < (|p| + 3)C$ so with some rearranging we get $f(p)$ by itself and a coefficient for $|p|$: $|f(p)| < (f(1) + C)|p| + 3C.$

Thus we can take $A = (f(1) + C)$ and $B = 3C$ to accomplish the task.

Now we *finally* have the tools we need to prove that we can get a slope out of these almost linear functions! Recall that we were going to define slope as $\lim_{n \to \infty} \frac{f(n) - f(0)}{n - 0} = \lim_{n \to \infty} f(n)/n.$

And to show that it converges, we show that it is a Cauchy sequence, meaning that terms in the sequence get arbitrarily close beyond a certain point: $\forall \epsilon > 0, \exists N > 0, \forall n, m > N, \\ \left|f(n)/n - f(m)/m\right| < \epsilon.$

Well, the lemma above gives us exactly what we need, just pretend that $n$ and $m$ stand for $p$ and $q$: $\left\vert\frac{f(n)}{n} - \frac{f(m)}{m}\right\vert = \frac{mf(n) - nf(m)}{mn} < \frac{(m+n+2)C}{mn}.$

Notice how the numerator is like $m+n$ and the denominator is like $m*n$, clearly the denominator will win as $m$ and $n$ get large!

To give more details, notice how we can bound the fraction like so $\frac{(m+n+2)C}{mn} = \frac{C}{n} + \frac{C}{m} + \frac{2C}{mn} < \frac{2C}{N} + \frac{2C}{N^2} < \frac{4C}{N},$ since $m, n > N$ and $N^2 > N$.

So now to get $4C/N$ to be less than $\epsilon$, we need $N > 4C/\epsilon$. So as $\epsilon$ gets really tiny and close to $0$, $N$ has to grow very large in response. But this is fine: it exists by the Archimedean property, completing our proof.

Let $f$, $g$ be two almost-linear functions, we need to show that $f*g = g*f$ by showing that $w(p) = f(g(p)) - g(f(p))$ is bounded.

We will use the same trick as above, comparing $f(g(p))$ and $g(f(p))$ to a common term. In fact itʼs a titch more subtle still: we will add extra factors of $p$ first, comparing $pf(g(p))$ and $pg(f(p))$ to the common term $g(p)f(p)$. Do you see where this is going? We get to use our lemmas from above!

Take $q = f(p)$ and $q = g(p)$ in our favorite lemma, to give us these two inequalities: $|pf(g(p)) − g(p)f(p)| < (|p| + |g(p)| + 2)C$ and $|g(p)f(p) − pg(f(p)| < (|p| + |f(p)| + 2)C.$

Now to get $|pf(g(p)) - pg(f(p))|$ we take the sum, getting a bound of $(2|p| + |f(p)| + |g(p)| + 2)C.$

That kind of looks awful, Iʼm not going to lie, **but** weʼre sooo close. Remember the other lemma that told us that $|f(p)|$ and $|g(p)|$ behaved roughly like $|p|$? We can use that here to say that it *all* behaves like $|p|$:

$(2|p| + |f(p)| + |g(p)| + 2)C < (2+A+A)C|p| + (B + B + 2)C.$

We can squash all that nonsense behind two new constants $D$ and $E$, reaching our next conclusion: $|pf(g(p)) - pg(f(p))| = |p|w(p) < D|p| + E.$

Take a breath … One last step.

Both sides now behave like $|p|$. If you look really closely, this actually implies that $w(p)$ is bounded: it must sort of behave like $D$ (the slope of $|p|$) on the other side. (It certainly canʼt behave like $|p|$ or $|p|^2$ or anything else that grows faster!)

More specifically, if we have $|p| w(p) < D|p| + E$, we can divide through by $|p|$ to get $w(p) < D + \frac{E}{|p|}.$

This is only valid when $p \ne 0$, however! Since $|p|$ is an integer, it would then have to be at least $1$, so $w(p) < D + E$ in that case.

And dealing with the case $p = 0$ on its own poses no problem (it bounds itself), so overall we have $w(p) \le \max(D + E, w(0)).$

This establishes that $w(p) = f(g(p)) − g(f(p))$ is bounded, and thus multiplication commutes!

Whew! Hopefully you enjoyed our journey through almost-linear integer functions to get to Eudoxus real numbers.

Here are some things I did my best to convey in this article:

- We started with the intuition of how graphing calculators display pixelated lines in order to find a notion of slope that we could derive from integer functions.
- My saw my favorite parts of Eudoxus real numbers: looking at the slope of lines means function composition
*creates*multiplication, and we get distributivity for free that way. - Perhaps you got a glimpse of the non-local nature of dealing with infinity: only the loooong term behavior matters when defining slope, because finite differences and even bounded differences simply disappear in the limit.
- And if you stuck around for the proofs (no shame if not!), you also got to peek at the methods of mathematical analysis, where techniques for dealing with limits and bounding abstract quantities are studied and put to use. (This is often covered in a “Real Analysis” course in colleges and universities.)

Thereʼs a lot still to talk about – like “what does it mean to *construct* a mathematical entity and what tools do we use to do so?”, and we didnʼt even prove that multiplicative inverses exist (what would that look like visually? I bet you have the intuition!) or that the Eudoxus reals are sequentially complete. But this is more than enough for now – the intuition is more important than the details.

For more on Eudoxus reals, including a cool construction of $\pi$ as a Eudoxus real number and references to further connections made with the construction, I would recommend Matt Baker’s Math Blog. And of course see the nLab, where I originally learned about this construction and which I already referenced several times, and Arthanʼs exposition, which helped me fill in the tougher proofs.