Leanpub: Publish Early, Publish Often

2. A Basic Lexer

We begin with the simple task of writing a program that extracts all words from the standard input and prints each of them on a separate line to standard output. Here we define a word to mean a string of uppercase or lowercase letters from the English alphabet. Any non-letter character between words should be ignored.

Alex is a preprocesser. It takes as as input the description of a lexical analyser and produces a Haskell program that implements the analyser. An Alex file for this task above can be found in wordcount/wordcount.x can be found in this book’s Github repository. Alex input files are usually given the extension .x. Invoking Alex on this file with the command

alex wordcount.x

produces the file wordcount.hs which in turn can be compiled with

ghc --make wordcount

Cabal also knows that it can get a file with an .hs extension by invoking alex on a file with the same name and an .x extension. If you have Alex files in your project you have to include alex among the build-tools and also include the array package as a dependency since the programs produced by alex use this package.

2.1 Structure of an Alex file

Initial code

Our wordcount.x file is composed of a number of components. At the beginning is a code fragment enclosed in braces which is copied verbatim by Alex to its output

{
module Main(main) where
}

This is where you put the module declaration for the generated code. This is also where you need to put any import statements that you may need for the rest of your program. Do not put any other declarations here since Alex will place its own imports after this section.

The wrapper

The next line

%wrapper "basic"

specifies the kind of interface Alex should provide to the lexical analyser it generates. The “basic” wrapper is the simplest option. In this case Alex will provide a function alexScanTokens::String->[Token] whose argument is a string which contains the entire input and whose result is a list of tokens, where Token is a user-defined type. In later chapters we will discuss other interfaces that Alex can provide which are more flexible but which also require more work from the user.

Macro definitions

Next come definitions of macros.

$letter = [a-zA-Z]
$nonletter = [~$letter\n]

A macro is a shortcut specifying a set of characters or a regular expressions. The names of character set macros begin with $ and those of regular expression macros begin with @.

In the definition of $letter the expression [a-zA-Z] on the right-hand side is a character set. The square brackets [] are the union operator, forming the union of a list of character sets given within the bracket. a-z and A-Z denote character ranges representing the lowercase and uppercase characters respectively whose union is the character set $letter.

In the definition of $nonletter the expression ~$letter denotes the complement of the character set $letter. The universal set in Alex’s negation operator does not contain the newline, so we specify it separately with the escape sequence \n.

Rules

After the definitions we specify the rules. The beginning of the rules section is market by the line token :-. The name token is purely decorative. Alex looks for only the :- to see where rules begin. In our example the rules are

tokens :-
  $nonletter+     ;
  $letter+            {id}

Each rule is a regular expression followed by action. Whenever a regular expression in a rule matches the input from the current position Alex carries out the corresponding action. If more than one rule matches the current input the longest match is preferred. If there are multiple longest matches then the rule which comes earlier in the Alex file is preferred.

Actions can be of two types: either a bare ; or some Haskell code. If the action is a bare ; then Alex skips the input which matches the rule. In the case of an action which is Haskell code, what Alex does depends on the wrapper. For the basic wrapper the code for each action must be a function of the type String->Token which is called by Alex with the matched input to produce a token.

The first rule in our example says that sequences of non-letters must be skipped. Here + is the regular expression operator which matches one or more occurrences of the preceding regular expression.

The second rule similarly matches a sequence of one or more letters. Alex’s rule for finding the longest match automatically ensures that it will extend a word as far as possible. The action in this case is the standard Haskell identity function id which returns its input unchanged. This implicitly defines our token data type to be String the value of a token is simply the source text corresponding the the word.

Final code segment

At the end of the Alex file is another code segment that is copied verbatim to the output file. We make use of it to define our main function.

{
main::IO ()
main = do
  s <- getContents
  let toks = alexScanTokens s
  mapM_ putStrLn toks
}

This reads in the the entire standard input into the String s. This is then passed to the Alex-generated function alexScanTokens which generates a list of tokens—in our case just a list of strings in which each element is a word found in our input. The final line of main prints each of these words in a separate line.

Up next

3. Monadic Parsers and Lexers