Table of Contents
Preface
The full source code for the examples can be downloaded from Github
This is a work-in-progress. Please send your comments and bug reports to jyotirmoy@jyotirmoy.net.
Cover image
Leafy seadragon. Image by user lecates’ on Flickr. Used under the terms of the CC BY-SA 2.0 license. The artwork for the cover page is released under a newer version of the same license and is included in the Github repository for this book.
1. Introduction
1.1 Lexers and parsers
Many programs take structured text as input. In the case of a compiler this text is a program written in a certain programming language. In the case of a analytics tool it might be a log file with entries in a particular format. In the case of a web server it may be a configuration file describing the different sites to be served. In all these cases the text first appears to the program as a simple sequence of bytes. It is the task of the program to interpret the bytes and discover the intended structure. That is the task of parsing.
Parsing itself is often divided into two phases. The first phase is lexical analysis, is carried out by a part of the program called the lexical analyser or lexer for short. This phase breaks up the sequence of bytes provided as input into a sequence of indivisible tokens while at the same time carrying out other tasks such as keeping track of line numbers of the source file and skipping whitespace and comments. The parser proper then takes these tokens and assembles them into larger structures according to rules given by a grammar.
Thus, given the following line of Haskell source code
module
Main
where
-- the main module
the lexical analyser will produce the tokens module
, Main
and where
while skipping the whitespaces and the comment. It may also mark module
and where
as keywords of the language. The parser then looks at this sequence of tokens and recognizes this as a statement declaring the module “Main”.
The advantage of having a separate lexer is that lexers are easy to write since the breaking a string of bytes into tokens usually does not require a understanding of the entire input at a time. Boundaries between tokens can usually be found by looking at a few characters at a time. At the same time by getting rid of extraneous data like whitespace and comments, the presence of a lexer simplifies the design and implementation of the parser since the parser can now deal with a more abstract representation of the input.
1.2 Alex and Happy
Alex and Happy are programs that generate other programs. Given a high-level description of the rules governing the language to be parsed, alex
produces a lexical analyser and happy
produces a parser. Alex and Happy are the intellectual descendants of the Unix programs Lex and YACC (Yet Another Compiler Compiler). But whereas the lexers and parsers generated by Lex and YACC were C, the programs generated by Alex and Happy are in Haskell and therefore can easily be used as a part of a larger Haskell program.
Installation
Alex and Happy are part of the Haskell Platform and should have been installed when you installed the Haskell Platform. The can also be installed or updated using the Cabal package manager.
2. A Basic Lexer
We begin with the simple task of writing a program that extracts all words from the standard input and prints each of them on a separate line to standard output. Here we define a word to mean a string of uppercase or lowercase letters from the English alphabet. Any non-letter character between words should be ignored.
Alex is a preprocesser. It takes as as input the description of a lexical analyser and produces a Haskell program that implements the analyser. An Alex file for this task above can be found in wordcount/wordcount.x
can be found in this book’s Github repository. Alex input files are usually given the extension .x
. Invoking Alex on this file with the command
alex wordcount.x
produces the file wordcount.hs
which in turn can be compiled with
ghc --make wordcount
Cabal also knows that it can get a file with an .hs
extension by invoking alex
on a file with the same name and an .x
extension. If you have Alex files in your project you have to include alex
among the build-tools
and also include the array
package as a dependency since the programs produced by alex
use this package.
2.1 Structure of an Alex file
Initial code
Our wordcount.x
file is composed of a number of components. At the beginning is a code fragment enclosed in braces which is copied verbatim by Alex to its output
{
module
Main
(
main
)
where
}
This is where you put the module declaration for the generated code. This is also where you need to put any import statements that you may need for the rest of your program. Do not put any other declarations here since Alex will place its own imports after this section.
The wrapper
The next line
%wrapper "basic"
specifies the kind of interface Alex should provide to the lexical analyser it generates. The “basic” wrapper is the simplest option. In this case Alex will provide a function alexScanTokens::String->[Token]
whose argument is a string which contains the entire input and whose result is a list of tokens, where Token
is a user-defined type. In later chapters we will discuss other interfaces that Alex can provide which are more flexible but which also require more work from the user.
Macro definitions
Next come definitions of macros.
$letter = [a-zA-Z]
$nonletter = [~$letter\n]
A macro is a shortcut specifying a set of characters or a regular expressions. The names of character set macros begin with $
and those of regular expression macros begin with @
.
In the definition of $letter
the expression [a-zA-Z]
on the right-hand side is a character set. The square brackets []
are the union operator, forming the union of a list of character sets given within the bracket. a-z
and A-Z
denote character ranges representing the lowercase and uppercase characters respectively whose union is the character set $letter
.
In the definition of $nonletter
the expression ~$letter
denotes the complement of the character set $letter
. The universal set in Alex’s negation operator does not contain the newline, so we specify it separately with the escape sequence \n
.
Rules
After the definitions we specify the rules. The beginning of the rules section is market by the line token :-
. The name token
is purely decorative. Alex looks for only the :-
to see where rules begin. In our example the rules are
tokens :-
$nonletter+ ;
$letter+ {id}
Each rule is a regular expression followed by action. Whenever a regular expression in a rule matches the input from the current position Alex carries out the corresponding action. If more than one rule matches the current input the longest match is preferred. If there are multiple longest matches then the rule which comes earlier in the Alex file is preferred.
Actions can be of two types: either a bare ;
or some Haskell code. If the action is a bare ;
then Alex skips the input which matches the rule. In the case of an action which is Haskell code, what Alex does depends on the wrapper. For the basic wrapper the code for each action must be a function of the type String->Token
which is called by Alex with the matched input to produce a token.
The first rule in our example says that sequences of non-letters must be skipped. Here +
is the regular expression operator which matches one or more occurrences of the preceding regular expression.
The second rule similarly matches a sequence of one or more letters. Alex’s rule for finding the longest match automatically ensures that it will extend a word as far as possible. The action in this case is the standard Haskell identity function id
which returns its input unchanged. This implicitly defines our token data type to be String
the value of a token is simply the source text corresponding the the word.
Final code segment
At the end of the Alex file is another code segment that is copied verbatim to the output file. We make use of it to define our main
function.
{
main
::
IO
()
main
=
do
s
<-
getContents
let
toks
=
alexScanTokens
s
mapM_
putStrLn
toks
}
This reads in the the entire standard input into the String
s
. This is then passed to the Alex-generated function alexScanTokens
which generates a list of tokens—in our case just a list of strings in which each element is a word found in our input. The final line of main
prints each of these words in a separate line.
3. Monadic Parsers and Lexers
Lexers and parsers must often maintain state while they parse a text. One way of encoding stateful computations in Haskell is to use the State
monad. We can do so using the monadic interface to Happy which allows parser actions to be of any monadic type, not just actions in the State
monad. Alex can work with monadic actions even more easily since it allows its actions to be of any type.
Finally, rather than requiring a list of tokens, Happy can provide a parser function which executes a monadic action to acquire a new token, thus easily interfacing with a monadic Alex parser.
The code for this chapter is in the folder arith1/
in the Github repository of this book. The language we are trying to parse is a simple language of arithmetic and boolean expressions adapted for Pierce’s Types and Programming Languages. There is one numeric literal 0
and two boolean literals true
and false
. The numeric functions succ
and pred
evaluate to the successor and predecessor of a number. The function iszero
applied to a number evaluates to true
or false
depending on whether a number is 0
or not. Finally, the term if [expr1] then [expr2] else [expr3]
evaluates to [expr2]
if [expr1]
evaluates to true
and to [expr3]
otherwise.
We want to provide a Read-Eval-Print Loop (a REPL) for expressions in this language. It will not do for such a program to die when faced with a lexical and syntax error. We expect it to print an error message and recover to accept the next expression. We will achieve this in this chapter by using Haskell’s MonadError class to handle errors gracefully.
3.1 The Lexer
The prologue
{
module
Lexer
(
Token
(
..
),
P
,
evalP
,
lexer
)
where
import
Control.Monad.State
import
Control.Monad.Error
import
Data.Word
}
Alex rules
tokens :-
$white+ ;
true {TTrue}
false {TFalse}
0 {TZero}
succ {TSucc}
pred {TPred}
if {TIf}
then {TThen}
else {TElse}
iszero {TIsZero}
Haskell code
First we define the token types.
{
data
Token
=
TTrue
|
TFalse
|
TZero
|
TSucc
|
TPred
|
TIf
|
TThen
|
TElse
|
TIsZero
|
TEOF
deriving
(
Eq
,
Show
)
and the functions that must be provided to Alex’s basic interface,
type
AlexInput
=
[
Word8
]
alexGetByte
::
AlexInput
->
Maybe
(
Word8
,
AlexInput
)
alexGetByte
(
b
:
bs
)
=
Just
(
b
,
bs
)
alexGetByte
[]
=
Nothing
alexInputPrevChar
::
AlexInput
->
Char
alexInputPrevChar
=
undefined
Now comes our parser monad. Control.Monad.Error
defines the type Either String
to be an instance of the class MonadError
which allows us to throw and catch exceptions. We will use this facility to demonstrate the graceful handling of parse errors. We apply the StateT
monad transformer to the base error monad to allow us to store the current input to alex as state.
type
P
a
=
StateT
AlexInput
(
Either
String
)
a
evalP
::
P
a
->
AlexInput
->
Either
String
a
evalP
=
evalStateT
Next we define an action in our monad which will produce a new token.
readToken
::
P
Token
readToken
=
do
s
<-
get
case
alexScan
s
0
of
AlexEOF
->
return
TEOF
AlexError
_
->
throwError
"!Lexical error"
AlexSkip
inp'
_
->
do
put
inp'
readToken
AlexToken
inp'
_
tk
->
do
put
inp'
return
tk
We have not provided any wrapper
specification to Alex, so it provides us with the lowest level interface. The function alexScan
takes an AlexInput
and start code and tries to produce a token. We will discuss start codes in detail in a later chapter. For now we use the default startcode of 0
. alexScan
can return four kinds of values.
-
AlexEOF
indicates the end of output. In this case we return the tokenTEOF
which will indicate end of input to the parser. -
AlexError
an error condition. Here we make use of thethrowError
action provided by the MonadError class to raise an exception within our monad. -
AlexSkip
is an indication an indication that some input needs to be skipped over.inp'
indicates the position in the input from which scanning should continue. We use theput
action provided byStateT
to make a record of this. Then we callreadToken
again to try and find a token. -
AlexToken
indicates that a token has been fount. Once againinp'
is the new input position which we save.tk
is the lexer action specified in the patter. For this lexer it is simply a value of theToken
type which we return.
Ideally the Happy parser would take a monadic action like readToken
and use it to fetch token. Instead, for historical performance reasons, it expects a different interface that we provide in the lexer
function.
lexer
::
(
Token
->
P
a
)
->
P
a
lexer
cont
=
readToken
>>=
cont
Happy passes a continuation cont
to the lexer, which represents the parsing task to be performed as a function of the next token read and expects it to be called with the actual next token. lexer
does precisely this, using readToken
to fetch the token and using the monadic bind operator >>=
to pass it on to the provided continuation. We could have also written it as
lexer
cont
=
do
token
<-
readToken
cont
token
3.2 The Abstract Syntax Tree Type
The AST
module provides an abstract syntax tree type for our language. It will be the job of the parser to build values of this type.
module
AST
where
data
Term
=
STrue
|
SFalse
|
SZero
|
SIsZero
Term
|
SSucc
Term
|
SPred
Term
|
SIfThen
Term
Term
Term
deriving
Show
3.3 The Parser
The prologue
This is just the module declaration and imports.
{
module
Parser
(
parse
)
where
import
AST
import
qualified
Lexer
as
L
import
Control.Monad.Error
}
Defining the types of lexer and parser
It is in this section that we tell Happy that we need a monadic parser and have a monadic lexer.
%monad{L.P}
%lexer{L.lexer}{L.TEOF}
%name parse
%tokentype{L.Token}
%error {parseError}
The declaration %monad{L.P}
says that we need a monadic parser that operates in the L.P
monad. The declaration %lexer{L.lexer}{L.TEOF}
specifies that we have a monadic lexer L.lexer
and L.TEOF
is the token pattern that denotes end of input.
Defining tokens
%token
true {L.TTrue}
false {L.TFalse}
zero {L.TZero}
iszero {L.TIsZero}
succ {L.TSucc}
pred {L.TPred}
if {L.TIf}
then {L.TThen}
else {L.TElse}
Grammar rules
%%
Term : true {STrue}
| false {SFalse}
| zero {SZero}
| iszero Term {SIsZero $2}
| succ Term {SSucc $2}
| pred Term {SPred $2}
| if Term then Term else Term {SIfThen $2 $4 $6}
The Error Handler
The error handling function parseError
is now of the type L.Token -> L.P a
. Happy requires that it be polymorhpic in type a
. In our case it again uses the throwError
action of MonadError
to raise an exception within the monad.
{
parseError
_
=
throwError
"!Parse Error"
}
3.4 The Driver
The driver code in Main.hs
and Evaluator.hs
provide a primitive REPL, taking one expression per line and printing the result of evaluating it. The interesting part as far as Alex and Happy are concerned are the following two lines in the function myREPL
in Main.hs
case
L
.
evalP
P
.
parse
(
encode
s
)
of
Right
t
->
putStrLn
(
output
t
)
Left
s
->
putStrLn
s
Because we are using Either String
as an error monad, a lexing or parsing error which leads to our calling throwError
results in L.evalP
returning a Left
value. In our case we just print in and continue on our way to read the next line. This is much better than the alternatives in earlier chapters of calling error
and dying.