Implementing the KotZen parser-combinator

Nick Lydon
6 min readJan 8, 2022

This is a follow-up post to the introduction to KotZen, a parser-combinator library. The project is on github and the docs are hosted on netlify if you want to browse the exported classes and functions. With this post I’ll show a breakdown of how it works from the bottom-up. If you’re really interested, I absolutely recommend that you buy Programming in Haskell by Graham Hutton, a very approachable book that was the basis for this library.

Simply put, a parser is a function that takes a string as input and returns a pair of values — the remaining unconsumed string, and a strongly-typed interpretation of the consumed input, e.g. a number, or boolean, or array: typealias Parser<T> = (String) -> Pair<String, T>. A parser that fails returns null for simplicity, however in the future it could be changed to return more information about how or where it failed.

Repeatedly creating substrings however doesn’t seem efficient, as they will produce more pressure on the garbage collector. As a result I’ve replaced String with a type that represents some “parseable” input:

/**
* Input to the parser
*/
interface Parseable {
/**
* Consumes input by a number of characters
*
@return The remaining input after consumption
*/
fun advance(steps: Int) : Parseable

/**
*
@return The number of characters that remain unparsed
*/
fun remaining() : Int

/**
*
@return The next character
*/
fun head() : Char

/**
*
@return Whether the end of the input has been reached
*/
fun isEmpty() = remaining() == 0

/**
*
@return The unparsed input
*/
fun unparsed() : String
}

/**
* A parser returns the remaining unparsed input and the typed result of parsing the input
*/
typealias ParserOutput<T> = Pair<Parseable, T>

/**
* A parser takes an input, which is parseable, and returns a typed output
*/
typealias Parser<T> = (Parseable) -> ParserOutput<T>?

This allows us to create a class that wraps the whole input string and moves a pointer to the current head of the unconsumed input:

/**
* Wraps a string with a pointer to the head of the remaining unparsed input
* to prevent repeatedly creating substrings
*/
data class StringParseable(private val string: String, private val index : Int = 0) : Parseable {
override fun advance(steps: Int): Parseable = StringParseable(string, index + steps)
override fun remaining(): Int = (string.length - index).coerceAtLeast(0)
override fun head(): Char = string[index]
override fun unparsed(): String = string.substring(index)
}

Next we create a helper that runs the parser on some input. If it succeeds it returns the typed-result. If unconsumed input remains, or the parser fails, it throws an exception:

/**
* Helper function to actually run the parser
*
@param input The string to be parsed
*
@return The typed result of parsing
*
@throws IllegalStateException When unparsed input remains, or cannot be parsed at all
*/
fun <T> Parser<T>.parse(input: String): T {
val output = this(StringParseable(input)) ?: error("Invalid input")

if (!output.first.isEmpty()) error("Unconsumed input: " + output.first.unparsed())

return output.second
}

Note that this is a static “Extension Function” that extends the Parser type. These are similar to extension methods in C# and are used to implement much of KotZen.

Finally we can create our first real parser:

/**
* Parses a character successfully if the given predicate returns true
*
@param fn Predicate that must be satisfied
*/
fun sat(fn: (Char) -> Boolean) = { input: Parseable ->
if (input.isEmpty() || !fn(input.head())) null
else Pair(input.advance(1), input.head())
}

This parser succeeds if the character that’s the head of the input “satisfies” a given predicate function. For example:

sat { it == 'a' }.parse("a")

will return the character ‘a’,

sat { it == 'a' }.parse("b")

fails with the message Invalid input, and finally

sat { it == 'a' }.parse("ab")

fails with the message Unconsumed input: b.

Now we’ve got a real parser (no matter how basic!), we can start to define some combinators that allow us to create more complicated parsers. We’ll start with transforming parser output to a new value:

/**
* Maps a parser output to a new value
*
@param fn A function to transform the parser output
*/
fun <S, T> Parser<S>.map(fn: (S) -> T): Parser<T> = { input: Parseable ->
val output = this(input)
if (output == null) null
else Pair(output.first, fn(output.second))
}

Here’s an example converting the character ‘A’ to its numeric code (65):

sat { it == 'A' }.map { it.code }.parse("A")

Next we’ll go about chaining parsers together, after all it’s not massively useful only being able to parse a single character! We’ll define a combinator that transforms the result of a parser to return a new parser:

/**
* Flattens multiple parsers (called flatMap elsewhere)
*
@param fn Given the result of the current parser, return a new parser
*/
fun <T, S> Parser<T>.bind(fn: (T) -> Parser<S>) = { input: Parseable ->
val output = this(input)
if (output == null) null
else fn(output.second)(output.first)
}

This is similar to map, in that the transformation function given as an argument can return a new type (from T to S), but different in that it actually returns a parser of that type. Here’s an example of parsing the character ‘A’ and then the character ‘B’:

sat { it == 'A' }.bind { first ->
sat { it == 'B' }.map { second -> listOf(first, second) }
}
.parse("AB")

The result is a collection containing two elements: the characters ‘A’ and ‘B’.

It’s also useful to have a parser that always succeeds and returns a value, but doesn’t consume any input:

/**
* Returns a parser that always succeeds.
* The result is the unconsumed input and the value passed to this function
*/
fun <T> pure(default: T) = { input: Parseable ->
Pair(input, default)
}

This can be useful for implementing parsers with optional inputs, where a default value, e.g. 0 or the empty list, can be substituted in.

So far we’ve dealt in certainties — we only handle input where the next character is a single value. Let’s allow for multiple options:

/**
* Returns a parser that returns the first successful result
*
@param alternative The parser to try if this one fails
*
@return The result from this parser, or failing that the alternative
*/
fun <T> Parser<T>.or(alternative: Parser<T>) = { input: Parseable ->
this(input) ?: alternative(input)
}

and it’s usage:

sat { it == 'A' }.or(sat { it == 'B' })

This will successfully handle both strings “A” and “B”.

We can now use this optionality, and ability to return defalt values to produce parsers that repeat themselves:

/**
* Returns a parser that consumes 0 or more
*/
fun <T> Parser<T>.many(): Parser<Iterable<T>> =
this.bind { initial ->
this.many().map { xs -> listOf(initial) + xs }
}
.or(pure(listOf()))

/**
* Returns a parser that consumes 1 or more
*/
fun <T> Parser<T>.atLeastOne() =
this.bind { initial ->
this.many().map { xs -> listOf(initial) + xs }
}

These are the most complicated combinators so far. Not only does many rely on bind, map, or and pure, but it also uses recursion! To understand it, it might be helpful to look at .or(pure(listOf())) first. This returns a parser that, should the first parser fail, will succeed with the empty list. Otherwise it will repeatedly run the parser for as long as it can consume input, concatenating the results into a list. Finally when it can consume no more, the result is concatenated with the empty list from pure(listOf()).

The atLeastOne() combinator is easy to understand after understanding many(), because the former relies on the latter. It is essentially the same, except it doesn’t have the fallback to an empty list until after it has consumed at least some input.

Edit: I’ve come up with a different implementation of many() and atLeastOne():

/**
* Returns a parser that consumes 0 or more
*/
fun <T> Parser<T>.many(): Parser<Iterable<T>> =
this.atLeastOne().or(pure(listOf()))
/**
* Returns a parser that consumes 1 or more
*/
fun <T> Parser<T>.atLeastOne() =
this.bind { this.many().map { xs -> listOf(it) + xs } }

This uses mutual recursion to eliminate the duplication of the previous implementation, and (in my humble opinion) makes it much clearer.

That brings us to the end of the post, as most of the other combinators and parsers follow naturally from these building blocks that we’ve already defined. It reminds me of the saying “mighty oaks from little acorns grow”! To see them when they’re grown, take a look at Parser.kt and a more advanced example.

--

--

Nick Lydon

British software developer working as a freelancer in Berlin. Mainly dotnet, but happy to try new things! https://github.com/NickLydon