Building custom DSLs in TypeScript

Why do we even need custom domain-specific languages?

Building custom DSLs in TypeScript

Why do we even need custom domain-specific languages?

First of all, we should define what do we mean by custom domain-specific languages.

In everyday programming, we have to encode real-world problems in code. We are limited by the power of expressivity of the language we are using. The programming languages are fairly low when it comes to model the real world.

Basically, we translate from a human language to a more basic form of language that a computer can understand.

While translating, and for future reference, we would like to keep the code as close to the real world as possible (easier to get feedback, easier to maintain, to assess).

Domain-specific languages are what is required to fill this gap, your application can be thought of as a series of modules each translating into progressively lower-level concepts so you encapsulate implementation details from business features and keep the code in a maintainable state.

Each of those modules exposes an API that other modules consume, the API is itself a domain-specific language.

Domain-specific languages are challenging, and there are multiple ways to do things (and trade-offs attached).

Usually, we think of a DSL to be tied to a single interpreter but that is not always the case, for example, we can think of a DSL to describe domain objects and out of the description we would like to get things like a serializer, a deserializer, an arbitrary, a comparator and many other things.

Generally, we can think of a DSL as an algebra of operations that can be interpreted in one or many ways.

So how can we build DSLs efficiently? In this blog post, we are going to discuss Final Encodings and Initial Encodings and look at many ways of implementing those.

Final Encoding

Starting from the easy one, Final Encoding means defining the language in terms of its direct interpretation.

Final Encodings are usually efficient but they come at a price, defining the language in terms of its interpretation means not having the possibility of multiple interpretations, they are harder to optimize (sometimes impossible) and many times unsafe (primarily due to stack safety in the js space).

Let's take a base example, we would like to model synchronous failable computations, we could encode the problem using a type like the following:

You can find the code at the following sandbox: Link

In this case, we decided to encode a failable operation with a thunk of code and our language allows for composition of operations by composition of the thunks of each operation.

Separately we can execute the program by calling up the thunk via the run method and the operations contained in the program description (including side effects) will be executed to return a result.

Let's look at the trade-offs, in this case, perf can't get any better and we don't really need multiple interpretations of a program so everything looks good!

Except if we try large recursive procedures like:

And try to compute factorial(100_000) you will get:

Now, that's a dumb way to compute a factorial but in real life, you might encounter frequently problems that will blow up the stack.

Initial Encoding

The idea behind initial encoding is to encode the DSL itself so in the previous example we would like to encode IO in terms of its operations (succeed, fail, suspend, chain, etc).

That is not the easiest thing, before being able to do so we need to take a few steps.

The first concepts we will see are plain and simple ADTs (Algebraic Data Types).

For general reference Algebraic Data Type is a broader term that can refer to types constructed as results of operations on types, those can be A | B, A & B, A ^ B, A / B, A x B, and many more.

For the purpose of this blog and in general, when you hear ADTs we mean unions of types, so things of the form A | B | C | D.

Let's try to encode the following problem: We would like to express basic algebraic operations (add, mul, div, etc) for numeric values.

We can do something like:

Sandbox at: Link

We've represented an Expr in a "free" way, meaning an Expr is the set of operations that is composed of.

We should note that we were able to write a program operation using multiple primitives without ever defining the meaning of the primitives themselves.

Lets now define an interpreter that computes the operation:

And running compute(operation) will result in the computed value.

We can see here that the compute function (the interpreter) has a recursive nature, which means large expressions will fail and blow up the stack.

But that recursion is contained in a single procedure that we could "easily" rewrite in a non-recursive manner making our language safe.

For the purpose of completeness, it is worth taking a look at the steps necessary to transform a general recursive program into an iterative one, the idea is that we will have a series of loops that progressively traverse the operations while preserving a stack of the remaining things to do.

What before was a nice and simple procedure now becomes an awful (but safe) piece of code:

As we can see we are now in full control of how the interpreter b behaves and we can make it safe, with the same idea in mind we will be able to make the previous module (IO) safe to use.

Exercise to the reader: implement print(expression): string that renders the expression like (1 + (3 * (4 / 1)))

Phantom Types & GADTs

We are now able to encode a generic DSL but we haven't introduced generics into the equation, Expr from before was only representing a numeric expression, what if we want an expression that can be numeric or string-based?

Also what if some operations are only relevant to numeric expressions and literal expressions?

We would like to have some form of: type Expr<A> where A can be number | string depending on the type of the expression.

In languages like Scala, we are able to define closed "interfaces" (sealed traits, sealed abstract classes) meaning interfaces implementable only by a specific number of concrete classes (like only in the same file).

In TS we have union types, so we could say:

The problem is A is unused here (phantom), and also there is no information that given a value NumericValue that actually corresponds to Expr<number>.

To simulate this behavior we will add the A parameter to every primitive so NumericValue will become NumericValue<A> and in the definition of NumericValue we will not only require a value of type number but we will also require a function _A: (n: number) => A that converts a number into the generic A.

the function _A will be filled by the constructor function numericValue(value: number): Expr<number> as an identity function.

That ensures A is fixed to number in the context of a NumericValue and will be used in the interpreter to return a generic A when dealing with concrete values of type number.

Let's see the code:

Let's write for simplicity a recursive interpreter, we know from before how to eventually make it stack-safe.

We can see that now compute is fully generic on A and how the identity function _A keeps track of the relationship concrete <> phantom.

Using the above compute(operationStr) will yield a string result, have a look at Link.

Existential Types

There is only one step left to do before being able to fully encode IO<E, A> using an initial representation, using the trick above we could implement almost all the primitives Success, Failure, etc except for Chain.

Chain should contain an operation of type IO<E, A> and a function f: (a: A) => IO<E1, B> and it should represent overall an IO<E | E1, B>.

If we think IO as type IO<E, A> = Success<E, A> | Failure<E, A> | ... | Chain we should have a Chain<E, A>.

There are a few types missing from the equation, those types are known as "Existential" meaning types that only exist within a bounded context in this case the Chain operation.

Existential types are very common when it comes to model continuations (like chain, operations that depend on the result of something else, and a set of functions that specify how to "continue", whatever "continue" means in context).

In order to model these additional parameters, we are going to model the behavior of using Chain instead of modeling Chain itself, meaning we will provide a use function that has to be used to access the content of Chain.

Let's get to code:

As we can see instead of having Chain contain io and f instead it contains a function use that takes a body thunk providing parameters to the body.

To interact with Chain we will use the use function by passing in use(({io, ...}) => do stuff).

Interpreter time, starting with the recursive one:

As we can see to deal with Chain we have to use a function (that is BTW not free perf-wise).

This simple recursive interpreter will have the same issue as the Final Encoding in running large recursive operations.

Also, the types are a bit verbose, on the plus side, everything is truly type-safe.

I personally think in most cases this encoding should be the first to be tried when building the first scratch of your modules and the perf compromise (calling _E, _A, use, etc) is negligible compared to the actual cost of the operations we model, leveraging the type-safety and the compiler help is definitely important in driving the correct implementation of the primitives and the interpreter.

When it comes to the conversion of the above to a non-recursive procedure we will, unfortunately, lose this nice safety.

Let's see how it would look like:

Sandbox at Link

Running the above in sandbox will fail because browsers have limited memory but if you run it in node it will output an insanely large number.

Optimized Variant

Given at the end if we make stack safe interpreters we are bailing out of type safety at the interpreter level we could remove the need for calling _E, _A, use and relax the type safety of primitives using fake functions to instruct the type-system on the correct variance of parameters.

We will end up with something like:

Note how we erased types from the content of primitives and how now Chain no longer needs use.

Note also how we use function types to control the variance of the IO type, in every primitive we inhabit the following:

In this case, both E and A represent outputs so they need to be covariant if we had an input, contravariant, parameter like a context R the computation needs we would inhabit that using:

The code in the article is available in github, try it out using gitpod

Stay updated on our current and upcoming training workshops!

Be social
Be social