Parsing may be a challenging task. And it’s amazing how Scala Parser Combinators make it doable for regular human beings. You can easily describe how to parse simple things like identifier, string constant or argument list, and then combine these simple parsers into more comples ones. I’d like to describe here step-by-step how to get that brilliant library working for you. No deep theory or specific features, just bare minimum to get started with typical tasks. We’re going to develop boolean expression parser and use it to evaluate some expressions. The complete project is available here:

Include the library in your build

Add "org.scalatest" %% "scalatest" % "2.2.1" % "test" to libraryDependencies in your build.sbt file

Create your model (abstract syntax tree)

Model doesn’t have any dependencies on library classes, you can write here anything that makes sense for your domain - you’ll be instantiating these classes later by hands. Here what I’ve got at this stage:

package parsers

object Ast {
sealed trait Expression
case class And(expressions: Seq[Expression]) extends Expression
case class Or(expressions: Seq[Expression]) extends Expression
case class Not(expression: Expression) extends Exp
case class Variable(name: String) extends Expression

Whenever there’s a need to use these classes, following import simplifies the job: import parsers.Ast._

Create a place for your parsers

Typically you’d like to extend from scala.util.parsing.combinator.JavaTokenParsers or scala.util.parsing.combinator.RegexParsers. They define all the necessary DSL for creating the parsers. JavaTokenParser extends RegexParsers with some sugar like definition of what is identifier or string constant in Java world - usually it’s quite useful. Here is my approach:

package parsers
import scala.util.parsing.combinator.JavaTokenParsers

object BooleanExpressionParser extends JavaTokenParsers {
import Ast._

def parse(str: String): Ast.Expression =
parseAll(expression, str) match {
case Success(result, _) => result
case failedOrIncomplete => throw new RuntimeException(failedOrIncomplete.toString)

private def expression: Parser[Expression] = ??? //will define it later
//... here go all your parsers

The entry point is private dev expression - it defines how to get the root of your AST from the string. Returning type for parsers (like Parser[Expression]) is optional, Scala almost always correctly detects it. However, here I’ll keep fhe types for the sake of clarify.

Define first parser

Let’s parse variable names. For the beginning we’ll follow Java approach, so x, X and someVar are valid names, while 123var not.
Add following parser to BooleanExpressionParser:

private def variable: Parser[Variable] = ident  ^^ (Variable(_))

and change expression to this:

private def expression: Parser[Expression] = variable

Here ident is a regex-based parser (it’s type is Parser[String]) from JavaTokenParsers which checks if a string is a valid identifier and if is - returns the string. ^^ is a combinator defined in scala.util.parsing.combinator.RegexParsers for transforming parsing result. Here ident returns a String, but we cant a Variable. If target class doesn’t need additional argument processing and has only one argument then simples approach like (Variable(_)) may be used.

Following check is satisfied now:

BooleanExpressionParser.parse("x") mustBe(Variable("x"))

Now suppose we need some processing of the string before creating target object. For example, we’d like variables to be case-insensitive. Then variablecan be changed this way:

private def variable: Parser[Variable] = ident  ^^ {case v => Variable(v.toLowerCase)}

This check is satisfied as well now:

BooleanExpressionParser.parse("X") mustBe(Variable("x"))

Parser that checks the prefix/suffix

Add this one:

private def not: Parser[Not] = "not" ~> expression ^^ (Not(_))

and change expression to this:

private def expression: Parser[Expression] = not | variable

Few points to note here:

  • | defines alternatives - parsers from the list are tried in order, and first successful is applied
  • "not" is also a parser here. Matches string “not” and nothing else. If matched - returns that string
  • ~> is a combinator telling that we’re not interested in result of parser on the left side - we just want to know that it matches the input. So result of the left part is discarded
  • There’s a counterpart <~ discarding the right side. You’ll see it later in the code

Parse repetitive fragment

Now let’s implement logical operation and, so that the following check would be satisfied:

BooleanExpressionParser.parse("x and y and z") mustBe(And(Seq(Variable("x"), Variable("y"), Variable("z"))))

Define following 2 parsers (here type qualifier in pattern matching are also optional - they’re here just to make the types more clear):

private def leftExpression: Parser[Expression] = not | variable
private def and: Parser[And] = leftExpression ~ ("and" ~> leftExpression).+ ^^
{case (left: Expression) ~ (right: Seq[Expression]) => And(left +: right)}

And change expression to: and | not | variable

Key points here are:

  • () - brackets wrap a part of parser into kind of “sub-parser”
  • something.+ - quantifier, requires underlying parser to match at least once. Returns Seq[] of what underlying parsers return
  • Other quantifiers are also available. To name a few:
    • something.* or rep(something)- match zero or more times,
    • repsep(something, separatorChar) - match something zero or more times, separated by separatorChar
    • something.? - match something zero or 1 time, returns Option
  • something1 ~ something2 - combines a parsers into a chain, every parser must match successfully and every parser result is preserved. Returns a special data structure that can be deconstructed with the same character ~ in pattern matching part.
  • leftExpression has to be introduced. If simply use expression as underlying element of and, then we end up with undefined recursion and StackOverflowException. and expression can’t start at the first character of another and expression - they have to be separated by at least one level of some sub-expressions. We’re going to use brackets later to resolve situation when there are nested expressions which require recursive parser: and -> or -> and

Parser with constant output

Sometimes you’d like to produce the same output for all inputs that match certain criteria. As an example - let’s consider everything that starts with /* and ends with */ as a comment and convert it to an empty string:

private def comment: Parser[String] = """/\*([^*]|\*[^/])*\*/""".r ^^^ ""

Here ^^^ completely discards the result of parsing, just makes sure that parsing was successful. Result is always takebn from the right side - an empty string in this case. You’ll see later how it’s integrated into the whole picture.

The whole picture

Now you have enough knowlege of parser combinators to understand the remaining parts. So here’s the final code of all parsers:

private def expression: Parser[Expression] = combinationExpression | leftExpression

private def combinationExpression: Parser[Expression] = comment.? ~> or | and <~ comment.?

private def leftExpression: Parser[Expression] = comment.? ~> not | brackets | variable <~ comment.?

private def brackets: Parser[Expression] = "(" ~> expression <~ ")"

private def variable: Parser[Variable] = ident ^^ {case v => Variable(v.toLowerCase)}

private def not: Parser[Not] = "not" ~> expression ^^ (Not(_))

private def and: Parser[And] = leftExpression ~ ("and" ~> leftExpression).+ ^^
{case (left: Expression) ~ (right: Seq[Expression]) => And(left +: right)}

private def or: Parser[Or] = (and | leftExpression) ~ ("or" ~> (and | leftExpression)).+ ^^ {case left ~ right => Or(left +: right)}

Evaluation of the AST is out of scope of this article, you can see the code on GitHub.

What’s left behind

Some important aspects didn’t find a place in the demo project

Case-insensitive string matching

If you want “true”, “True” and “TRUE” to be regarded as the same value, you need to use regular expressions. You may write your wrapper like this:

private def r(str: String) = ("(?i)" + str).r

And then use it like this:

private def boolConst: Parser[String] = r("true") | r("false")

Exclude reserved words from matching

You may use guard to check the input, but don’t consume it, and then not for making a parser fail if sub-parser succeeds:

private def reservedWords: Parser[String] = guard(“if” | “case” | “class” | “object” | “trait”)
private def notReservedIdentifier: Parser[String] = not(reservedWords) ~> ident

Hope it will make a start with parser combinators easier for you!