Leanpub: Publish Early, Publish Often

1. Introduction

1.1 Lexers and parsers

Many programs take structured text as input. In the case of a compiler this text is a program written in a certain programming language. In the case of a analytics tool it might be a log file with entries in a particular format. In the case of a web server it may be a configuration file describing the different sites to be served. In all these cases the text first appears to the program as a simple sequence of bytes. It is the task of the program to interpret the bytes and discover the intended structure. That is the task of parsing.

Parsing itself is often divided into two phases. The first phase is lexical analysis, is carried out by a part of the program called the lexical analyser or lexer for short. This phase breaks up the sequence of bytes provided as input into a sequence of indivisible tokens while at the same time carrying out other tasks such as keeping track of line numbers of the source file and skipping whitespace and comments. The parser proper then takes these tokens and assembles them into larger structures according to rules given by a grammar.

Thus, given the following line of Haskell source code

module   Main where   -- the main module

the lexical analyser will produce the tokens module, Main and where while skipping the whitespaces and the comment. It may also mark module and where as keywords of the language. The parser then looks at this sequence of tokens and recognizes this as a statement declaring the module “Main”.

The advantage of having a separate lexer is that lexers are easy to write since the breaking a string of bytes into tokens usually does not require a understanding of the entire input at a time. Boundaries between tokens can usually be found by looking at a few characters at a time. At the same time by getting rid of extraneous data like whitespace and comments, the presence of a lexer simplifies the design and implementation of the parser since the parser can now deal with a more abstract representation of the input.

1.2 Alex and Happy

Alex and Happy are programs that generate other programs. Given a high-level description of the rules governing the language to be parsed, alex produces a lexical analyser and happy produces a parser. Alex and Happy are the intellectual descendants of the Unix programs Lex and YACC (Yet Another Compiler Compiler). But whereas the lexers and parsers generated by Lex and YACC were C, the programs generated by Alex and Happy are in Haskell and therefore can easily be used as a part of a larger Haskell program.

Installation

Alex and Happy are part of the Haskell Platform and should have been installed when you installed the Haskell Platform. The can also be installed or updated using the Cabal package manager.

Up next

2. A Basic Lexer