Parsing may be a challenging task. And it’s amazing how Scala Parser Combinators make it doable for regular human beings. You can easily describe how to parse simple things like identifier, string constant or argument list, and then combine these simple parsers into more comples ones. I’d like to describe here step-by-step how to get that brilliant library working for you. No deep theory or specific features, just bare minimum to get started with typical tasks. We’re going to develop boolean expression parser and use it to evaluate some expressions. The complete project is available here: https://github.com/paul-lysak/ParserCombinatorsDemo
Include the library in your build
Add "org.scalatest" %% "scalatest" % "2.2.1" % "test"
to libraryDependencies in your build.sbt
file
Create your model (abstract syntax tree)
Model doesn’t have any dependencies on library classes, you can write here anything that makes sense for your domain - you’ll be instantiating these classes later by hands. Here what I’ve got at this stage:
package parsers
object Ast {
sealed trait Expression
case class And(expressions: Seq[Expression]) extends Expression
case class Or(expressions: Seq[Expression]) extends Expression
case class Not(expression: Expression) extends Exp
ression
case class Variable(name: String) extends Expression
}
Whenever there’s a need to use these classes, following import simplifies the job: import parsers.Ast._
Create a place for your parsers
Typically you’d like to extend from scala.util.parsing.combinator.JavaTokenParsers
or scala.util.parsing.combinator.RegexParsers
. They define all the necessary DSL for creating the parsers. JavaTokenParser
extends RegexParsers
with some sugar like definition of what is identifier or string constant in Java world - usually it’s quite useful. Here is my approach:
package parsers
import scala.util.parsing.combinator.JavaTokenParsers
object BooleanExpressionParser extends JavaTokenParsers {
import Ast._
def parse(str: String): Ast.Expression =
parseAll(expression, str) match {
case Success(result, _) => result
case failedOrIncomplete => throw new RuntimeException(failedOrIncomplete.toString)
}
private def expression: Parser[Expression] = ??? //will define it later
//...
//... here go all your parsers
//...
}
The entry point is private dev expression
- it defines how to get the root of your AST from the string. Returning type for parsers (like Parser[Expression]
) is optional, Scala almost always correctly detects it. However, here I’ll keep fhe types for the sake of clarify.
Define first parser
Let’s parse variable names. For the beginning we’ll follow Java approach, so x
, X
and someVar
are valid names, while 123var
not.
Add following parser to BooleanExpressionParser
:
private def variable: Parser[Variable] = ident ^^ (Variable(_))
and change expression
to this:
private def expression: Parser[Expression] = variable
Here ident
is a regex-based parser (it’s type is Parser[String]
) from JavaTokenParsers
which checks if a string is a valid identifier and if is - returns the string. ^^
is a combinator defined in scala.util.parsing.combinator.RegexParsers
for transforming parsing result. Here ident
returns a String
, but we cant a Variable
. If target class doesn’t need additional argument processing and has only one argument then simples approach like (Variable(_))
may be used.
Following check is satisfied now:
BooleanExpressionParser.parse("x") mustBe(Variable("x"))
Now suppose we need some processing of the string before creating target object. For example, we’d like variables to be case-insensitive. Then variable
can be changed this way:
private def variable: Parser[Variable] = ident ^^ {case v => Variable(v.toLowerCase)}
This check is satisfied as well now:
BooleanExpressionParser.parse("X") mustBe(Variable("x"))
Parser that checks the prefix/suffix
Add this one:
private def not: Parser[Not] = "not" ~> expression ^^ (Not(_))
and change expression
to this:
private def expression: Parser[Expression] = not | variable
Few points to note here:
|
defines alternatives - parsers from the list are tried in order, and first successful is applied"not"
is also a parser here. Matches string “not” and nothing else. If matched - returns that string~>
is a combinator telling that we’re not interested in result of parser on the left side - we just want to know that it matches the input. So result of the left part is discarded- There’s a counterpart
<~
discarding the right side. You’ll see it later in the code
Parse repetitive fragment
Now let’s implement logical operation and
, so that the following check would be satisfied:
BooleanExpressionParser.parse("x and y and z") mustBe(And(Seq(Variable("x"), Variable("y"), Variable("z"))))
Define following 2 parsers (here type qualifier in pattern matching are also optional - they’re here just to make the types more clear):
private def leftExpression: Parser[Expression] = not | variable
private def and: Parser[And] = leftExpression ~ ("and" ~> leftExpression).+ ^^
{case (left: Expression) ~ (right: Seq[Expression]) => And(left +: right)}
And change expression
to: and | not | variable
Key points here are:
()
- brackets wrap a part of parser into kind of “sub-parser”something.+
- quantifier, requires underlying parser to match at least once. ReturnsSeq[]
of what underlying parsers return- Other quantifiers are also available. To name a few:
something.*
orrep(something)
- match zero or more times,repsep(something, separatorChar)
- matchsomething
zero or more times, separated byseparatorChar
something.?
- matchsomething
zero or 1 time, returnsOption
something1 ~ something2
- combines a parsers into a chain, every parser must match successfully and every parser result is preserved. Returns a special data structure that can be deconstructed with the same character~
in pattern matching part.leftExpression
has to be introduced. If simply useexpression
as underlying element ofand
, then we end up with undefined recursion andStackOverflowException
.and
expression can’t start at the first character of anotherand
expression - they have to be separated by at least one level of some sub-expressions. We’re going to use brackets later to resolve situation when there are nested expressions which require recursive parser:and
->or
->and
Parser with constant output
Sometimes you’d like to produce the same output for all inputs that match certain criteria. As an example - let’s consider everything that starts with /*
and ends with */
as a comment and convert it to an empty string:
private def comment: Parser[String] = """/\*([^*]|\*[^/])*\*/""".r ^^^ ""
Here ^^^
completely discards the result of parsing, just makes sure that parsing was successful. Result is always takebn from the right side - an empty string in this case. You’ll see later how it’s integrated into the whole picture.
The whole picture
Now you have enough knowlege of parser combinators to understand the remaining parts. So here’s the final code of all parsers:
private def expression: Parser[Expression] = combinationExpression | leftExpression
private def combinationExpression: Parser[Expression] = comment.? ~> or | and <~ comment.?
private def leftExpression: Parser[Expression] = comment.? ~> not | brackets | variable <~ comment.?
private def brackets: Parser[Expression] = "(" ~> expression <~ ")"
private def variable: Parser[Variable] = ident ^^ {case v => Variable(v.toLowerCase)}
private def not: Parser[Not] = "not" ~> expression ^^ (Not(_))
private def and: Parser[And] = leftExpression ~ ("and" ~> leftExpression).+ ^^
{case (left: Expression) ~ (right: Seq[Expression]) => And(left +: right)}
private def or: Parser[Or] = (and | leftExpression) ~ ("or" ~> (and | leftExpression)).+ ^^ {case left ~ right => Or(left +: right)}
Evaluation of the AST is out of scope of this article, you can see the code on GitHub.
What’s left behind
Some important aspects didn’t find a place in the demo project
Case-insensitive string matching
If you want “true”, “True” and “TRUE” to be regarded as the same value, you need to use regular expressions. You may write your wrapper like this:
private def r(str: String) = ("(?i)" + str).r
And then use it like this:
private def boolConst: Parser[String] = r("true") | r("false")
Exclude reserved words from matching
You may use guard
to check the input, but don’t consume it, and then not
for making a parser fail if sub-parser succeeds:
private def reservedWords: Parser[String] = guard(“if” | “case” | “class” | “object” | “trait”)
private def notReservedIdentifier: Parser[String] = not(reservedWords) ~> ident
Hope it will make a start with parser combinators easier for you!