Leanpub: Publish Early, Publish Often

3. Monadic Parsers and Lexers

Lexers and parsers must often maintain state while they parse a text. One way of encoding stateful computations in Haskell is to use the State monad. We can do so using the monadic interface to Happy which allows parser actions to be of any monadic type, not just actions in the State monad. Alex can work with monadic actions even more easily since it allows its actions to be of any type. Finally, rather than requiring a list of tokens, Happy can provide a parser function which executes a monadic action to acquire a new token, thus easily interfacing with a monadic Alex parser.

The code for this chapter is in the folder arith1/ in the Github repository of this book. The language we are trying to parse is a simple language of arithmetic and boolean expressions adapted for Pierce’s Types and Programming Languages. There is one numeric literal 0 and two boolean literals true and false. The numeric functions succ and pred evaluate to the successor and predecessor of a number. The function iszero applied to a number evaluates to true or false depending on whether a number is 0 or not. Finally, the term if [expr1] then [expr2] else [expr3] evaluates to [expr2] if [expr1] evaluates to true and to [expr3] otherwise.

We want to provide a Read-Eval-Print Loop (a REPL) for expressions in this language. It will not do for such a program to die when faced with a lexical and syntax error. We expect it to print an error message and recover to accept the next expression. We will achieve this in this chapter by using Haskell’s MonadError class to handle errors gracefully.

3.1 The Lexer

The prologue

{
module Lexer (Token(..),P,evalP,lexer) where
import Control.Monad.State
import Control.Monad.Error
import Data.Word
}

Alex rules

tokens :-
    $white+      ;
    true			{TTrue}
    false		{TFalse}
    0			{TZero}
   succ			{TSucc}
   pred			{TPred}
   if			{TIf}
   then			{TThen}
   else			{TElse}
   iszero		{TIsZero}

Haskell code

First we define the token types.

{
data Token = 
     TTrue
     | TFalse
     | TZero
     | TSucc
     | TPred
     | TIf
     | TThen
     | TElse
     | TIsZero
     | TEOF
     deriving (Eq,Show)

and the functions that must be provided to Alex’s basic interface,

type AlexInput = [Word8]
alexGetByte :: AlexInput -> Maybe (Word8,AlexInput)
alexGetByte (b:bs) = Just (b,bs)
alexGetByte []    = Nothing

alexInputPrevChar :: AlexInput -> Char
alexInputPrevChar = undefined

Now comes our parser monad. Control.Monad.Error defines the type Either String to be an instance of the class MonadError which allows us to throw and catch exceptions. We will use this facility to demonstrate the graceful handling of parse errors. We apply the StateT monad transformer to the base error monad to allow us to store the current input to alex as state.

type P a = StateT AlexInput (Either String) a

evalP::P a -> AlexInput -> Either String a
evalP = evalStateT

Next we define an action in our monad which will produce a new token.

readToken::P Token
readToken = do
	  s <- get  
	  case alexScan s 0 of
      	        AlexEOF -> return TEOF
		AlexError _ -> throwError "!Lexical error"
	   	AlexSkip inp' _ -> do	
			  put inp'
			  readToken
	   	AlexToken inp' _ tk -> do 
			  put inp'
			  return tk

We have not provided any wrapper specification to Alex, so it provides us with the lowest level interface. The function alexScan takes an AlexInput and start code and tries to produce a token. We will discuss start codes in detail in a later chapter. For now we use the default startcode of 0. alexScan can return four kinds of values.

AlexEOF indicates the end of output. In this case we return the token TEOF which will indicate end of input to the parser.
AlexError an error condition. Here we make use of the throwError action provided by the MonadError class to raise an exception within our monad.
AlexSkip is an indication an indication that some input needs to be skipped over. inp' indicates the position in the input from which scanning should continue. We use the put action provided by StateT to make a record of this. Then we call readToken again to try and find a token.
AlexToken indicates that a token has been fount. Once again inp' is the new input position which we save. tk is the lexer action specified in the patter. For this lexer it is simply a value of the Token type which we return.

Ideally the Happy parser would take a monadic action like readToken and use it to fetch token. Instead, for historical performance reasons, it expects a different interface that we provide in the lexer function.

lexer::(Token -> P a)->P a
lexer cont = readToken >>= cont    

Happy passes a continuation cont to the lexer, which represents the parsing task to be performed as a function of the next token read and expects it to be called with the actual next token. lexer does precisely this, using readToken to fetch the token and using the monadic bind operator >>= to pass it on to the provided continuation. We could have also written it as

lexer cont = do
  token <- readToken
  cont token

3.2 The Abstract Syntax Tree Type

The AST module provides an abstract syntax tree type for our language. It will be the job of the parser to build values of this type.

module AST where

data Term =
     STrue
     | SFalse
     | SZero
     | SIsZero Term
     | SSucc Term
     | SPred Term
     | SIfThen Term Term Term
     deriving Show

3.3 The Parser

The prologue

This is just the module declaration and imports.

{
module Parser(parse) where
import AST
import qualified Lexer as L
import Control.Monad.Error
}

Defining the types of lexer and parser

It is in this section that we tell Happy that we need a monadic parser and have a monadic lexer.

%monad{L.P}
%lexer{L.lexer}{L.TEOF}
%name parse
%tokentype{L.Token}
%error {parseError}

The declaration %monad{L.P} says that we need a monadic parser that operates in the L.P monad. The declaration %lexer{L.lexer}{L.TEOF} specifies that we have a monadic lexer L.lexer and L.TEOF is the token pattern that denotes end of input.

Defining tokens

%token
true  	{L.TTrue}
false 	{L.TFalse}
zero   	{L.TZero}
iszero        {L.TIsZero}
succ		{L.TSucc}
pred		{L.TPred}
if		{L.TIf}
then		{L.TThen}
else		{L.TElse}

Grammar rules

%%
 
Term	:  true				{STrue}
|  false			{SFalse}
|  zero				{SZero}
|  iszero Term			{SIsZero $2}
|  succ Term			{SSucc $2}
|  pred Term			{SPred $2}
|  if Term then Term else Term	{SIfThen $2 $4 $6}

The Error Handler

The error handling function parseError is now of the type L.Token -> L.P a. Happy requires that it be polymorhpic in type a. In our case it again uses the throwError action of MonadError to raise an exception within the monad.

{
parseError _ = throwError "!Parse Error"

}

3.4 The Driver

The driver code in Main.hs and Evaluator.hs provide a primitive REPL, taking one expression per line and printing the result of evaluating it. The interesting part as far as Alex and Happy are concerned are the following two lines in the function myREPL in Main.hs

          case L.evalP P.parse (encode s) of
            Right t -> putStrLn (output t)
            Left s -> putStrLn s

Because we are using Either String as an error monad, a lexing or parsing error which leads to our calling throwError results in L.evalP returning a Left value. In our case we just print in and continue on our way to read the next line. This is much better than the alternatives in earlier chapters of calling error and dying.