Practical Interpreter Construction
Practical Interpreter Construction
Mehmet Emin Coşkun
Buy on Leanpub

Chapter 1 - Motivation

Interpreters Are Everywhere

  • Ever thought of how your mobile phone gets the number from an SMS or WhatApp message? It parse the text and get the number from. Similar to this, smilies work the same way: an interpreter gets the text, parses it and interprets that there is a smiley in the text and then shows you an image of that smiley
  • Do you use a spreadsheet software like Excel? If so, you do some calculations. It calculates the cell values first by parsing and then interpreting that source according to math rules and according to Excel functions
  • If you use any sort of command line, like Linux’s Bash shell or Mac OS’s terminal, you know it takes the commands first and then understands them, then finally executes them. The command line does that by interpreting the commands. Every operating system shell is an interpreter. Some of them are capable of executing scripts, such as bash, tcsh, zsh…
  • I am writing this book using Markdown markup language. To produce the final html I use one of the Markdown interpreters. It takes the text written using Markdown syntax and produces the html output that I can use for this book’s web site
  • Almost every application has a configuration file. You can change the systems’ behaviour by changing that file. The system takes the configuration file, processes it according to commands in it. This component in the system is an interpreter.
  • The web browser you use every single day interpretes HTML, JavaScript and CSS to present you beautiful web pages.
  • A server that exchanges messages has a fixed structure. You need to parse messages and take actions on the server side according to the message payload.
  • Pattern recognition software that highlights plagerism must also use an interpreter to highlight problems.
  • All template engines are interpreters. If you need to modify and write a new template engine for any purpose either for web development or any other environment, you will need an interpreter

Apart from above examples, languages like Python, PHP, Ruby, JavaScript, CSS, HTML, etc. all include a parser and interpreter that can be use more effectively once you learn how parsers and interpreters work. With hardware getting cheaper and cheaper everyday, interpreters are becomming more popular as they allow developers to use more abstract languages that are easier to produce complex programs in. Like every abstraction in the computer world, interpreters help us to get the problem at a more understandable level with interpreted languages.

I think you have got better understanding about interpreters now, and more importantly, you know the benefits of learning how interpreters work.

Is It That Difficult to Write an Interpreter?

The answer is: No.

It is not difficult to write your own interpreter and your own programming language.

So, fear not if you have studied this area before but found yourself lost in the plethora of daunting and complex theory.

If you decide to study this area, then you will notice there is a lot of theory in it. Like anything and everything, there are more ways than one to get to a goal. So, it increases the complexity by adding the dilemma of too many choice to this theory heavy subject.

Books in this area are generally compiler-centered books. Of course most of them tells you how to write an interpreter and include certain chapters regarding different components of interpreters and compilers. There is no problem with this approach, but the fact is, it makes the books longer. Let’s all admit it, the longer, the less you’d like to read it.

Life is always busy and you have a life apart from computers. I have kept that in mind while writing this book and focused on simple and concise explainations.

Furthermore, many of these books are filled with dense complicated calculus with little explaination. They assume you’ve had prior exposure to the theory before handling the text, and fail to explain many core concepts.

But a lot of programmers learn not by reading, they learn by doing. The most important and valuable learning tool for them is reading well written software source code.

I am the same. I learn by doing, by reading interesting code. So, I am not into reading theory centered, long and difficult to understand books.

This book is different. I learned by writing my own interpreter from the ground up. This knowledge has let me cut out the trouble and confusion that I had to contend with.

This book is more than an interpreter book. It is a motivational book that shows you that things do not have to be complicated to study, learn and apply.

How We Will Make it Easy

  • I am a pragmatic programmer and I am into practical things. Theory is important of course, nobody can neglect it. Lets be very aware of that theory and practice are intertwined, we shouldn’t break them up.
    But let’s accept that we have deadlines, we have families, we have hobbies, things to do, and most importantly, a life. For writing an interpreter or a compiler, I don’t think that we have to read long books or to study every single detail of them. So, I am going to show you how to write a working interpreter by writing real code. We will learn by doing along with this book using concrete examples. Plus, there are a lot of programmers who are self-taught or who have been away from computer science theory for long time. We need to use a language that they can understand. Programming is an intuitive thing that one must practice it understand. So, let’s become less formal and more practical.
  • Anything can be improved but we have to be aware of that there is no end to editing. So, if we keep changing things to improve then we can not reach a reasonable end.
    True artists know when and where to stop by identifying a goal or target. We are going to write an interpreter that is working and useful, but we are not going to labor endlessly to make it perfect. We need to start somewhere to reach our goals. If we let ourselves dwell on the size and complexity of compilers and interpreters, we’ll never get started actually making one. My book makes you start, learning enough theory and practice to write your own interpreter. Besides, you will learn to imrprove your arsenal of skills in terms of tokenizers, parsers, interpreters and compilers. We are not going to optimize and focus on beautifying our code for style sake.
  • Compilers and interpreters are such a deep topic. So, it is easy to get lost if you do not define the topic set that you will study. I made this book with the intent of focusing on interpreters since I want you to learn how to build tokenizers, parsers and interpreters. I consciously did not incorporate compilers to keep this book short, and the content concise. Those topics I include in this book are essential and when you learn them, you will be in a good shape to continue into more elaborate topics.
  • I will show you working examples in every single chapter. So, you don’t have to wait to comprehend through example or to produce working code. Getting instant results is important to staying motivated.
  • The method that I will show you to parse the source code is referred to as recursive descent parsing.
    Recursive descent parsing is the easiest and most intuitive method of parsing. One can claim that there are more effective methods than recursive descent parsing but don’t rush to believe it is so. There are many real world example compilers and interpreters that are built using recursive descent parsing and they work quite effectively.
    One of the most famous example is the GNU Compiler Collection that was written from scratch using recursive descent parsing tecnique. You can find the details about it here: GCC 3.4 Release Series, Changes, New Features, and Fixes Check out the section that starts with “A hand-written recursive-descent”. Plus, every technique has its own advantages according to place that it is used. Our technique -recursive descent parsing- is elegant enough for us.
  • I am using Java since most modern programmers are familiar with it or a similar syntax and structure. Most programmers are able to be get used to it quickly too. I consciously have not used C as it can be very difficult to read programs written in C. Plus, Java has features that will save us writing some bare minimums and memory management early on. You will see Lisp derivative languages in many places when you start to study compilers. Some programmers find it easy to parse and interpret things using list processing. I may agree with them but there is a fact that, most programmers do not prefer using Lisp. Many of them never programmed in Lisp. It may be Common Lisp, Scheme or any similar in the functional programming world.
    We had interpreters and compilers without functional languages, like Lisp, and we will have them in the future.
    Let’s be realistic, you can build working and smart interpreters in C, JAVA, C#, Ruby, Perl or in similar languages that are “Turing Complete”.
  • I am not concerned with using object oriented programming techniques for this book, as they would add complications and cloud the clarity of the lesson. As Joel Spolsky says, “Don’t Let Architecture Astronauts Scare You” Object oriented programming is a useful paradigm, but it is not a must-have for every project. Plus, imperative techniques are easier to understand and create demonstrations in.
    Of course, I use classes and objects, but I am not over extending their use or forcing inheritance and polymorphism. Let’s remember, we are here to learn, not to develop perfectly to style constraints.
  • The grammar of the programming language that I will write an interpreter for in this book is fixed. We’re not trying to parse every possible grammar for every possible language. It will save us from digressing and also it will help focus us. Some may say that there is no point in an interpreter for a single grammar, but I pose that learning, parsing, and interpreting one language readies you to do so with more. With recursive descent parsing, you can write interpreters for many sorts of grammars. Once we’ve completed this interpreter, you will be able to extend it for more languages and grammars.
  • I have written this book with the language I would use if I were to be there teaching you these lessons in person.
  • We need all type of programmers. It is an excellent reason for learning parsing, interpreters, and compilers. Nobody has to be an academic or an expert in this field. Even if you were to never write a compiler or interpreter as part of production code, it will help you become a better programmer through better understanding. Big or small, better or best, any program, script or application that improve daily life is useful. We are not seeking perfection, we are learning.
  • Finally, I am not including any compiler tools in this book. There are some books that tell you how to construct a compiler and interpreter using yacc, lex, antlr, llvm or any one of many compiler construction kits. Try to search for how to construct an interpreter or ask on forums like Stackoverflow and the like. All you can get is advise that you should use this tool, that tool, this compiler kit, that compiler library. Instead of showing you full working examples, people tend to point out deeper things; far from practical advice. Besides, one should learn essentials first. Without essentials, the abstractions that tools provide you hide the important aspects that are worth your time to know. Plus, recursive descent parser is most intuitive technique to implement among other parsing techniques. So, you will craft your interpreter without having to learn any other extra tools.

We are clear now. Let’s get started with a simple interpreter: A Calculator.

Chapter 2 - Parsing and Calculating Expressions

Parsing Arithmetic Expressions

The Magic in The Calculator

We all use calculators. they are in our computers, mobile phones, software that we use or a handy one in our desk that some of use use every single day. Have you ever noticed that some calculators are a bit different than others? that they can understand and calculate things like this:

(3+7*8)-1

Normally, in an ordinary calculator you need to enter first 7*8 and then add 3 to it and then subtract 1 from that. I know some of them some memory functions that you can save some interim results but you need to enter the expression anyway.

For the above expression, the difference is that you can enter the entire expression at once, then get the calculation result. Scientific calculators can handle full expressions at a time. Plus, some of them compute expressions with variables like:

x = a * b + 2

In this expression, the first thing to calculate is multiplication, then addition, and finally subtraction.

We call it “operator precedence.” It is a very important concept that can be seen everywhere in arithmetic expressions. While simple calculators do arithmatic the way I explained above, the scientific calculators take the above expression in whole, then give you the result directly so that you don’t need to remember any operator precedence or interim values.

It is because those smart calculators are able to interpret what you enter. Congratulations; you have met one of the most used interpreters.

In this chapter, I am going to teach you how to program an intelligent calculator like those ones. With our calculator, we will enter any expresssion and it will give us the result directly.

Our method is iterative. That means we will start with simple steps and we will advance gradually. Let’s start with addition and subtraction and then add multiplication and division.

To program this, here is what we need:

  • A method for getting (reading) numbers from an input string
  • A method that recognizes the operator and calculates the result according to that operator
  • A method presents the result

We are going to use one string, one integer and and one character as global variables. These are for respectively:

  • expression (String): The string that contains our expression
  • currentCharPosition (int): The index number of the character that we are handling from expression string
  • Look (char): The character variable that holds the character that we read from expression, one by one

Let’s write our class and first method. We will also write a main method to demonstrate it. Our first class is for reading a character from experssion string.

Here we are:

class Calculator
{    
    public static String expression = "";   
    public static int currentCharPosition = 0;    
    public static char Look;
    
    public static void GetChar()
    {
        if (currentCharPosition < expression.length())
            Look = expression.charAt(currentCharPosition);
            
        currentCharPosition++;
    }        
    
    public static void main(String args[])
    {
        expression = "1";
        GetChar();
        System.out.println(Look);
    }
}

Lets compile and run our first program. Here is the result:

1

It works. We have got a working program that reads a single number from our expression.

I want to explain the only method in the class: GetChar. It reads a single character from the string that we then put it to Look global variable. Next, it increments the currentCharPosition variable. So, as we read more chars using GetChar method, we will be able to know whether or not there are more characters to read by comparing the currentCharPosition with the expression string’s length.

Now, lets put an operator into our expression string and read one more character from it. Here is our new main method:

    public static void main(String args[])
    {
        expression = "1+";
        GetChar();
        System.out.println(Look);
        GetChar();
        System.out.println(Look);
    }

Let’s compile and execute the program.

1
+

Our program is able to read characters from the expression string one by one. Let’s add one more number to the expression string and compile & run.

Here is the new main method:

    public static void main(String args[])
    {
        expression = "1+2";
        GetChar();
        System.out.println(Look);
        GetChar();
        System.out.println(Look);
        GetChar();
        System.out.println(Look);
    }
1
+
2

So far so good. We are able to read the characters in the expression. Now, it’s time to convert them something meaningful. Let’s write a method that gets a digit from the expression and make a number from it. Here is our GetNum method:

    public static int GetNum()
    {
        int number = new Integer(Look+"").intValue();
        GetChar();
        
        return number; 
    }

Let’s add it to Calculator class.

All it does is taking the character that we stored into global Look variable and converting it to an integer. Note that it calls GetChar to read next character in the expression string and put it into Look variable.

Before using it, we need an initializer method that reads first digit in the expression string and put it to Look variable. Otherwise GetNum method gets a null value since there is nothing in the Look variable in the beginning. Here is our Init method, all it does is call GetChar:

    public static void Init()
    {        
        GetChar();
    }

And here is the new main method:

    public static void main(String args[])
    {
        int result = 0;        
        expression = "1+2";
        System.out.println("Expression: " + expression);
        Init();        
        int firstNumber = GetNum();
        char operator = Look;
        GetChar();
        int secondNumber = GetNum();
        System.out.println("First Number: " + firstNumber);
        System.out.println("Operator: " + operator);
        System.out.println("Second Number: " + secondNumber);        
        int sum = firstNumber + secondNumber;
        System.out.println("SUM of Those Two Number: " + sum);        
    }    

Update the program with these changes and then compile & run it.

Here is the output:

Expression: 1+2
First Number: 1
Operator: +
Second Number: 2
SUM of Those Two Number: 3

We are able to read first number, then read addition operator and then second number, and our calculator presents us the sum of those two numbers.

We can apply same technique for subtraction.

I think you’ve noticed a problem here: what if we need to calculate expressions that includes arbitrary amount of numbers and operators? For example:

expression = 1+2+5+7+3

We need a better approach.

I think you can guess that we need to use a loop since we have got more things to read and calculate. But then we will face with another need: in the loop, we need to check out if the operator is addition or subtraction. Then, when we decide to add multiplication and division features to our calculator, yet again it needs to be checked out multiplication and division operators in the loop.

There is a better approach, but before getting into that, let me tell you something important that will help us a lot when calculating mathematical expressions: anything and everything in nature has a structure and pattern. Look around carefully, you will see that every single thing hides a structure and pattern inside itself.

Look petals on flower, scales on a fish, spirals on a snail’s shell, words and sentences in the languages and like many other things then you will see patterns.

Mathematical expressions are no different. They obey a pattern. Actually, mathematics is how we model patterns we see in the nature, so we can calculate and make predictions based on these models. For now, that is as far as we will go into the theory behind pattern recognition.

So, our simple expressions have a pattern as well. Lets have a look this pattern:

1+2

This simple expression tells us:

  • First a number
  • Then an operator
  • Then a number again

Lets have a look subtraction

5-2

Pattern is same:

  • First a number
  • Then an operator
  • Then a number again

There is a formula here. This formula is the same for both addition and subtraction. We can input any number to this formula; the numbers and operators may change but the formula remains the same.

It gives us an idea as to how we can model mathematical expressions using this formula.

Let’s refer to the first and second numbers as “Terms”. So, our formula becomes:

ArithmeticExpression = Term + Term

or

ArithmeticExpression = Term - Term

Lets express it in one single formula:

ArithmeticExpression = Term +- Term

Developing a formula is good, but how are we going to program it? Let me help you by giving you this tip: Let’s write a method for each element in the formula. That means we will write methods for handling ArithmeticExpression, Term and operators (+-).

Here are our new methods:

    public static void MatchAndEat(char chr)
    {
        if (Look==chr) GetChar(); else
        {
            System.out.println("Error: Unexpected character.");
            System.exit(0);
        }
    }
    
    public static int Term()
    {
        return GetNum();
    }

    public static int Add()
    {
        MatchAndEat('+');
        return Term();
    }

    public static int Subtract()
    {
        MatchAndEat('-');
        return Term();
    }

    public static int ArithmeticExpression()
    {        
        int result = Term();        
        
        switch(Look)
        {
            case '+':
                result = result + Add();
            break;
            case '-':
                result = result - Subtract();
            break;                               
        }
        
        return result;
    }

    public static void main(String args[])
    {
        expression = "1+9";
        Init();
        System.out.println(ArithmeticExpression());
    }

Lets compile & runn it.

10

We input 1+9 and it returned 10. It has calculated correctly. Let’s try it with 5-2, change the expression string to

expression = "5-2";

and compile & run it:

3

The calculation is correct again. Before diving into the explaination, I want to explain the MatchAndEat method. I choose this name because it preforms two important functions:

  • Checks if the global Look variable contains the expected character given as the method parameter
  • If it matches, it reads the next character from the expression string, otherwise it terminatesthe program

We will use this method to compare what is coming up in the string with what we expect to see in the expression. This way, we will be able to check if the expression has the correct structure in accordance with our formula.

The technique in the MatchAndEat method is important. We will use it throughout the book for verifying and enforcing the formula and pattern that we define.

Now, let’s get back to our first working calculator. The progam code is very close to formula, but there is a bit more to do. In the Expression method, we get the first Term, then we expect one of + or - operators. Depending on if the Operator is + or - will determine whether we will call Subtract or Add, then we get second Term.

Here is how the ArithmeticExpression() method works:

  • Call Term method to get first Term
  • Check out global Look variable with switch statement
  • If it is +, call Add
  • If it is -, call Subtract
  • Return result

In the Add and Subtract methods: MatchAndEat() method checks if it is expected operator. If so, it takes to second Term of the expression.

Let me ask, what if we have an expression like this:

9+2+5-3

or

9-7+3

You will remember that I proposed using loop to process more numbers than two, but then we digressed to find out a way that we can use a general formula for either addition or subtraction.

The trick is, since we have got the formula for calculating expressions, we can clone that formula in a loop, so we can calculate any amount of numbers.

Think of our formula as a black box that we use several times to calculate an expression.

Where are we going to place this loop? I think we will get first number, that means we will place Term method first and then we will place a loop to execute addition or subtraction according to upcoming operator.

I think it is good idea to put that loop before switch statement. So, after getting first Term, we can get arbitrary amount of Terms by calling Add() or Subtract() methods in the loop.

Here is our new ArithmeticExpression method:

    public static int ArithmeticExpression()
    {        
        int result = Term();                
        while ( (Look == '+') || (Look == '-') )
        {
            switch(Look)
            {
                case '+':
                    result = result + Add();
                break;
                case '-':
                    result = result - Subtract();
                break;                               
            }
        }        
        return result;
    }

And here is our new main method:

    public static void main(String args[])
    {
        expression = "9+2+5-3";
        Init();
        int result = ArithmeticExpression();
        System.out.println("Result: " + result);
    }

Lets compile & run the program.

Result: 13

It works like a charm.

We can calculate any expression with addition or subtraction or both and with any amount of numbers.

What about multiplication and division? Let’s recall the formula for addition and subtraction:

ArithmeticExpression = Term +- Term

You will notice that all arithmetic expressions are based on this formula. The Term can be a number, a variable, a function or something similar that you can calculate it first than add or subtract to or from another Term.

This feature makes us thinking that we can put a multiplication or division result into a Term. Here is an example:

ArithmeticExpression = 8 + 3

T = 8
T = 3

or

T = 8 = 2 * 4
T = 3 = 6 / 2

So, as we did for ArithmeticExpression before, we can formulate the Term as well, like this:

Term = Number * Number

or

Term = Number / Number

Let’s make it short:

Term = Number */ Number

We can name numbers in the Term formula as “Factor”. So, our formula becomes

T = Factor */ Factor

Here is forumations for both ArithmeticExpression and Term:

ArithmeticExpression = Term +- Term
Term = Factor */ Factor

ArithmeticExpression comprises Terms, and Terms comprise Factors. That means the ArithmeticExpression method will include terms and +- operators, the Term method will include factors and */ operators, and the Factor method will consist of the GetNum method call that it takes a character from expression and put it into Look.

You are probably wondering how our calculator will know that multiplication and division have a higher order of precident than addition and subtraction since we never told the calculator this. In truth, our formula makes it guaranteed that multiplication and division operators have higher precedence than addition and division since the method call chain is

ArithmeticExpression --> Term --> Factor

By using the right formula, we were able to write our code in a way that already reflects the correct order of operations.

Here are our new methods, let’s preform these changes in our program, compile, and run it.

Factor, Multiply and Divide methods:

    public static int Factor()
    {
        return GetNum();
    }

    public static int Multiply()
    {
        MatchAndEat('*');
        return Factor();
    }

    public static int Divide()
    {
        MatchAndEat('/');
        return Factor();
    }

New version of Term method that is with Factor:

    public static int Term()
    {        
        int result = Factor();                
        while ( (Look == '*') || (Look == '/') )
        {
            switch(Look)
            {
                case '*':
                    result = result * Multiply();
                break;
                case '/':
                    result = result / Divide();
                break;                               
            }
        }        
        return result;
    }

New main method:

    public static void main(String args[])
    {
        expression = "9*3-1+8*5-7";
        System.out.println("Expression: " + expression);
        Init();
        int result = ArithmeticExpression();
        System.out.println("Calculation Result: " + result);
    }

The rest is the same. Add new methods to the Calculator.java, replace the Term method with the new one, compile, and run it. Here is the result:

Expression: 9*3-1+8*5-7
Calculation Result: 59

The result is correct. We have a calculator that we can ask it to calculate any expression that includes addition, subtraction, multiplication, division. Plus, we can calculate expressions that include arbitrary amounts of numbers.

If we have a pattern/formula that we can apply it to the expressions that have the same structure, then we can program the same kind of needs without much hassle. We are going to keep this attitude throughout this book. So, our programs will be easy to write once we get the pattern.

Now, you should be confident to learn the basics of parsing and compiling jargon. As you see, we used methods/functions that call each other to calculate arithmetic expressions. Also, reading characters and parsing them will be the same, we just call a couple of methods to read and check characters and branch out according to the results.

This method is recursive descent parsing. It is easiest and most intuitive parsing method. It is also quite well know and documented. Let me point out something important: recursive descent parsing is NOT only for parsing and calculating arithmetic expressions. You can use recursive descent parsing to parse source code. We will be doing that in the next chapters, which is why I chose this tactic.

Before closing this chapter, let’s add support for paranthesis to our calculator. So, our expressions will be tidier and more handsome. To do this, lets re-think and re-define our formula.

Here is what we have at the moment:

ArithmeticExpression = Term +- Term
Term = Factor */ Factor

As I told you that our terminal is factor, we can formulate it this way:

Factor = Number

So, how about re-formulating it this way:

Factor = Number or (ArithmeticExpression)

We can do that because every expression has a numerical result. So, the Factor can either be a literal number or the result of an expression. The only thing that we need to support is surrounding the ArithmeticExpression with left and right paranthesis to separate it from a literal number.

We will change the Factor method according to this new formula, here we go:

    public static int Factor()
    {        
        int result = 0;       
        if (Look == '(')
        {
            MatchAndEat('(');
            result = ArithmeticExpression();
            MatchAndEat(')');
        } else result = GetNum();        
        return result;
    }

Please replace the Factor method with this new versoin, compile, and run your program with this expression:

(9*3-1+8)*5-7
Expression: (9*3-1+8)*5-7
Calculation Result: 163

Pretty nice that we were able to make such a huge improvement by simply surrounding an Arithmetic Expression with parenthesis. In this expression, it calculates

9*3-1+8

first, then multiplies it by 5 and subtracts 7.

Let me explain how new Factor method works:

Notice that there are two calls to the MatchAndEat method. These are for checking the left and right paranthesis that covering expression, which is ArithmeticExpression() method call.

Here another important point is, we have got methods that call each other recursively. It needs to be implemented this way since our formula has an ArithmeticExpression inside of Factor, and Factors are in the ArithmeticExpression as well.

This exemplifies another feature of recursive descent parsing that functions call each other for parsing the program source. It is important concept in recursive descent parsing.

So far so good. We have got a full working calculator/arithmetic expression parser that we programmed using recursive descent parsing method. I do recommend you study this code very well and improve it according to your taste and style. You can write a simple calculator that you get the expression string from keyboard or even from a file. Also, feel free to try and tweak the presentation of the result.

I am stressing that you need to study this chapter and the programs we have written, because they are the very essential basis of what we are going to do in the next chapters. Using the mehods that we learned in this chapter, you will be able to parse and interpret much more complex expressions, and soon, whole source code.

Here is the whole program so that you can copy-paste, compile, run, and study.

class Calculator
{    
    public static String expression = "";   
    public static int currentCharPosition = 0;    
    public static char Look;
    
    public static void Init()
    {        
        GetChar();
    }    
    public static void GetChar()
    {
        if (currentCharPosition < expression.length())
            Look = expression.charAt(currentCharPosition);
            
        currentCharPosition++;
    }
    public static int GetNum()
    {
        int number = new Integer(Look+"").intValue();
        GetChar();
        
        return number; 
    }    
    public static void MatchAndEat(char chr)
    {
        if (Look==chr) GetChar(); else
        {
            System.out.println("Error: Unexpected character.");
            System.exit(0);
        }
    }    
    public static int Factor()
    {        
        int result = 0;       
        if (Look == '(')
        {
            MatchAndEat('(');
            result = ArithmeticExpression();
            MatchAndEat(')');
        } else result = GetNum();        
        return result;
    }
    public static int Multiply()
    {
        MatchAndEat('*');
        return Factor();
    }
    public static int Divide()
    {
        MatchAndEat('/');
        return Factor();
    }
    public static int Term()
    {        
        int result = Factor();                
        while ( (Look == '*') || (Look == '/') )
        {
            switch(Look)
            {
                case '*':
                    result = result * Multiply();
                break;
                case '/':
                    result = result / Divide();
                break;                               
            }
        }        
        return result;
    }
    public static int Add()
    {
        MatchAndEat('+');
        return Term();
    }
    public static int Subtract()
    {
        MatchAndEat('-');
        return Term();
    }
    public static int ArithmeticExpression()
    {        
        int result = Term();
        while ( (Look == '+') || (Look == '-') )
        {
            switch(Look)
            {
                case '+':
                    result = result + Add();
                break;
                case '-':
                    result = result - Subtract();
                break;                               
            }
        }        
        return result;
    }
    public static void main(String args[])
    {
        expression = "(9*3-1+8)*5-7";
        System.out.println("Expression: " + expression);
        Init();
        int result = ArithmeticExpression();
        System.out.println("Calculation Result: " + result);
    }    
}

PacMan Eating Characters: The Tokenizer

In the last section, we used the methods GetChar and GetNum. The second one, GetNum, uses the GetChar method. There are also two variables, Look and currentCharPosition, which accompany those methods.

In this chapter, we are going to talk about currentCharPosition and Look variables. We are going to see their importance in compiler and interpreter jargon. Knowing the general jargon and definitions in any area makes your understanding of a topic much more thorough, whether from this book or other sources. Those sources are other compiler books, articles, compiler and interpreter source code, etc. that you will read when you advance your study.

The Look variable is named as such as it functions as a look ahead character in parsing jargon. It is used to store the current character we are evaluating in the expression. The look variable gets its new value when GetChar and its subsidiary methods gets called. For example, GetNum retrieves the char from the Look variable, converts it a number to return, and then calls GetChar to advance the look ahead character.

When method calls advance the look ahead character, another important thing happens: the currentCharPosition variable needs to be increased at every call of GetChar. Otherwise we would be stuck reading the same character forever. currentCharPosition is a number that holds where we are: the number it holds is the character index in the expression string. Here is an example:

expression = "5+6";

Here, when we call Init method, you know there is a call to GetChar in the Init method, then the Look variable becomes ‘5’ and currentCharPosition becomes 1. We then call GetNum, the Look variable becomes ‘+’ and currentCharPosition becomes 2. It starts from 0 and goes untill 2 since 2 is the last index in the string.

Here, we take advantage of Java strings that we can read them as character arrays. So, this array starts from zero and goes to 2. So, currentCharPosition starts with 0 and goes until 2.

The logic and the method calls here is very important as it is fundamental structure of all tokenizers. A Tokenizer is a program that reads the source, splits it into collections of characters, called tokens and give those tokens to us as we decide to handle them.

You are going to recall that wasn’t a tokenizer in the last chapter. There wasn’t a separate one, but the program itself was a tokenizer that read the expression. And then, our program took some actions based on that character, if it is a number, then fetch it; if it is an operator, then perform calculation according to that operator.

Tokenizers are not too intelligent themselves, they just get tokens and put them into a data structure, like an array or linked list, and hand the collection over to parser.

The parser is the part of the program that takes tokens from a data structure, checks them, and takes some action according to the token’s value. The actions that parser takes can be directly interpreted, producing machine code or building another data structure to give to another component of the program. The parser take those actions based on the rules of the programming language it was built for.

The advantage in separating the tokenizer and parser is important to recognize. If we design our interpreter this way, like separate black boxes, our program will be more consistent. This way, if tokenizer breaks, then we only need to check the tokenizer side, since its duty is only giving us the tokens that parser needs. Or similarly, if the parser breaks, then we only need to fix the parser. They can be tested separately, which makes your program more modular and saner to manage.

There are many other advantages that you will realize as we advance, but I won’t spoil them here.

Here, we can depict our program according to this new jargon:

(Program Source Code) ---> |TOKENIZER| ---> (Tokens) ---> |PARSER| ---> |INTERPRETER|

So, what good is knowing this? Let me answer that question by asking another important question: Our program calculates expressions that consists of numbers that have only one digit, right? So, what if we want to use multi-digit numbers? Like this one:

853+92*10-20/2+771

We need a technique that handles the numbers that are 853, 92, 10, 20, 2 and 771.

Lets meet use the token concept! These numbers are the tokens that our parser needs to handle as whole parts, not by single numbers.

The solution is reading the characters one by one, merging them in a string, and putting it into a token until we see a character that signifies we need to stop. For example, if we read 853+92, then we need to read 8,5, and 3 before we see a + sign and put 853 into a token, then we need to re-start reading to read 9 and 2 and put 92 into another token. Digits are stored in the characters one by one. All we do is getting them one by one and merging them in a string.

Lets get start to implement this plan.

We need to change our GetNum method. Here is the new one:

    public static int GetNum()
    {
        int number = 0;
        String tempStr = "";        
        if (!Character.isDigit(Look))
        {
            System.out.println("Error: Numbers expected."); System.exit(0);
        }        
        while (Character.isDigit(Look))
        {         
            tempStr = tempStr + Look;
            GetChar();
        }               
        number = new Integer(tempStr).intValue();
        return number; 
    }

Now, change the old GetNum method to this new version. Then use this expression string:

expression = "(9*3-1+8)*5-70";

compile & run the program:

Expression: (9*3-1+8)*5-70

What happens here?

It just writes and waits. The program does not end. There must be an infinite loop somewhere. I know what happens here, let me give you a tip to find out; change the expression string by adding at the end of it a space like this:

expression = "(9*3-1+8)*5-70 ";

Then compile-run the program.

Expression: (9*3-1+8)*5-70
Calculation Result: 100

It works. Interesting. All we have done is just added a space character to the end of the expression string. That means, there must be a character at the end of the expression string. Why? We only changed GetNum method, so the problem must be there.

Change the expression string to previous one to reproduce the problem again:

expression = "(9*3-1+8)*5-70";

It is the one without space at the end. The program gets into an infinite loop with this expression.

Lets go to GetNum method and have a look the while loop there:

while (Character.isDigit(Look))
{
    tempStr = tempStr + Look;
    GetChar();
}

Lets put these two lines into the while loop to write the value in the Look variable in every single step of while loop. The sleep call is only for slowing down the output so we can read it more clearly. This is a poor man’s debugger that you can use for small programs. If you have a debugger you’re used to using, that is fine too.

        while (Character.isDigit(Look))
        {
            System.out.println(Look);
            try {Thread.sleep(300);} catch (Exception e) {}
            tempStr = tempStr + Look;
            GetChar();
        }       

And run the program. Here we go:

Expression:(9*3-1+8)*5-70
9
3
1
8
5
7
0
0
0
0
0
0
0
...

There is a zero at the end of expression string and the program writes it infinite manner. It does not stop unless we break it with CTRL-C. There is no result line that we normally see when program ends.

The zero is the last digit that it reads in the expression string, and it does not advance anymore. That digit remains in the Look variable.

The problem is, while loop become infinite since it needs to see a character different than digit in the Look variable.

This while loop ends if Look gets something different than digit type of character like space, new line or anything else. Even a letter is fine. Try this one for example:

expression = "(9*3-1+8)*5-70A";
Expression:(9*3-1+8)*5-70A
9
3
1
8
5
7
0
Calculation Result: 100

It works. The last line written is Calculation Result: 100. It is the correct result.

So, the reason for the infinite loop earlier was because there was no terminating character to tell us we were at the end of the expression. Since the while condition must see a different character, we’ll use a space character for our examples.

We know the reason and the fix now. I prefer doing this in code rather than in every expression:

expression += " ";

Here is our new main method:

    public static void main(String args[])
    {
        expression = "(9*3-1+8)*5-70";
        expression += " ";
        System.out.println("Expression:" + expression);
        Init();
        int result = ArithmeticExpression();
        System.out.println("Calculation Result: " + result);
    }    

So, it adds a space to end of expression string every time and it is fixed.

I pointed out this small thing because it is very important. While writing a tokenizer, parser, interpreter or compiler, even a single character can break everything. You need to think through this sort of case.

Since we fixed the issue, we can remove our debugger from the GetNum method.

We are able to handle multi-digit numbers. So, lets try this one:

expression = "853+92*10-20/2+771";
2534

It gives us the correct result. We successfully added the multi-digit number handling feature to our calculator without much effort and without changing the core methods like Factor, Term and Expression. Did you notice that we only change the tokenizer code, not the parser? The parser methods are Factor, Term and Expression. We have not touched them.

So far so good. We are progressing well.

Now, lets write a pretty printer for our calculator, so our output is more detailed, and easier to debug. Like this:

Number..: 853
Operator: ADD
Number..: 92
Operator: MULTIPLY
Number..: 10
Operator: SUBTRACT
Number..: 20
Operator: DIVISION
Number..: 2
Operator: ADD
Number..: 771
You have got 6 different number and 5 operators.
Calculation result is 2534.

The first option is, in the program code, I can print things as I come across them. But it really ends up being a messy solution that makes the program crowded and difficult to maintain. For only having a pretty print or something similar, I wouldn’t want to complicate the program. As our program grows, it becomes more and more difficult to read, understand and maintain if we keep adding that way.

I have a second option. after each Getnum call, I can store the number into a data structure. I would also need to do same thing for all the operators that I come across. Then, after having my array list filled, I can print it in any format I choose. I don’t think that it is an effective solution since it’s similar to the first option, and shares it’s pitfalls.

There is another option, which is my favorite; like we did in GetNum, we can get operators as though they are separate entities. Say for example, we name them operator tokens and number tokens.

The byproduct of this approach is that once the two types of tokens are separated out, we can perform computation on them in a simpler way.

Our intention was having a more structured printing facility, but I claim that we can use this new structure in our parser method (Factor, Term, and Expression).

Let me explain it using an analogy. Remember, we were calculating single digit numbers including expressions by checking out Look variable and by advancing currentCharPosition as we read the characters in the expression string. In that expression string, our single smallest element was character.

Then, we changed it to make our single smallest elements to numbers.

Now, our smallest elements are tokens instead of numbers. Tokens are java strings that includes whole multi-digit numbers or operators.

Our calculator will get those tokens and pass them to parsing methods– Factor, Term, and Expression. The calculation part of the program does not need know that the number is single digit or multi-digit. All it knows is its type. So, according to its type it will add, subtract, multiply and divide according to the upcoming operator.

That explaination should be clear enough to introduce you to the tokenizer.

We need a new class: Token. Its objects will be for storing a token’s value and type. Here it is:

class Token
{
    public String text;
    public String type;

    public Token(String text, String type)
    {
        this.text = text;
        this.type = type;
    }
    public String toString()
    {
        return "Text: " + text + " Type: " + type; 
    }
}

We need a new global variable for storing token index. It is like currentCharPosition that we keep track of the current characters index in the expression. Now, currentTokenPosition will keep track of the index of current token in the java array list. we are going to advance currentTokenPosition as we get a new token from java array list.

Here it is currentTokenPosition:

    public static int currentTokenPosition = 0;

We will have three methods that we split the expression into tokens.

  • IsOp
  • FindOpType
  • Tokenize
  • Isop is for to check out if a character is an operator like + - / *
  • FindOpType is to find out what type of operator given character is
  • Tokenize is to split up the expression into tokens and to store them into an array list

Here is complete program with those methods:

import java.util.List;
import java.util.ArrayList;

class Tokenizer
{    
    public String expression = "";   
    public int currentCharPosition = 0;    
    public char Look;

    public void Init()
    {        
        GetChar();
    }    

    public void GetChar()
    {
        if (currentCharPosition < expression.length())
            Look = expression.charAt(currentCharPosition);
            
        currentCharPosition++;
    }    

    public boolean IsOp(char chr)
    {    
        return (chr == '+') || (chr == '-') || 
                (chr == '*') || (chr == '/') || 
                (chr == '(') || (chr == ')');
    }

    public String FindOpType(char chr)
    {
        String type = "UNKNOWN";
        switch(chr)
        {
            case '+':
                type = "ADD";
            break;            
            case '-':
                type = "SUBTRACT";
            break;            
            case '*':
                type = "MULTIPLY";
            break;            
            case '/':
                type = "DIVIDE";
            break;            
            case '(':
                type = "LEFT_PAREN";
            break;            
            case ')':
                type = "RIGHT_PAREN";
            break;            
        }
        return type;
    }

    public List<Token> Tokenize(String source)
    {
        List<Token> tokens = new ArrayList<Token>();
        String token = "";        
        String state = "DEFAULT";
        
        for (int index = 0; index < source.length(); index++)
        {
            char chr = source.charAt(index);
            switch(state)
            {
                case "DEFAULT":                
                    String opType = FindOpType(chr);                    
                    if (IsOp(chr))
                    {                       
                        tokens.add(new Token(Character.toString(chr), opType));
                    }                    
                    else if (Character.isDigit(chr))
                    {
                        token += chr;
                        state = "NUMBER";
                    }               
                break;
                
                case "NUMBER":
                    if (Character.isDigit(chr))
                    {
                        token += chr;
                    }
                    else
                    {
                        tokens.add(new Token(token, "NUMBER"));
                        token = "";
                        state = "DEFAULT";
                        index--;
                    }                
                break;
            }
        }
        return tokens;
    }
    
    public static void main(String args[])
    {
        String expression = "219+341+19";
        expression += " ";
        Tokenizer tokenizer = new Tokenizer();
        List<Token> tokens = tokenizer.Tokenize(expression);
        for (Token token: tokens)
        {
            System.out.println("Type:" + token.type + " Text: " + token.text);
        }
    }
    
}

I have written its main method to test our tokenizer. Lets compile & run this program:

Type: NUMBER Text: 219
Type: ADD Text: +
Type: NUMBER Text: 341
Type: ADD Text: +
Type: NUMBER Text: 19

Nice. We have got a pretty printer. Plus, we have the expression elements in an array list. We can use this array list and the tokens inside it for calculating the expression.

Before getting to the calculation part, I want give you more proper version of the pretty print method. Lets add below PrettyPrint method to our program in Tokenizer.java. Also replace the main method with below one.

    public void PrettyPrint(List<Token> tokens)
    {
        int numberCount = 0;
        int opCount = 0;
        
        for (Token token: tokens)
        {
            if (token.type.equals("NUMBER"))
            {
                System.out.println("Number....: " + token.text);
                numberCount++;
            }
            else
            {
                System.out.println("Operator..: " + token.type);
                opCount++;
            }                
        }
        System.out.println("You have got "+ numberCount + 
                           " different number and " + opCount
                           +" operators.");
    }
    
    public static void main(String args[])
    {
        String expression = "219+341+19";
        expression += " ";        
        
        Tokenizer tokenizer = new Tokenizer();
        List<Token> tokens = tokenizer.Tokenize(expression);
        
        System.out.println("--------------");
        tokenizer.PrettyPrint(tokens);
    }

Lets compile & run. Here is the output:

--------------
Number....: 219
Operator..: ADD
Number....: 341
Operator..: ADD
Number....: 19
You have got 3 different number and 2 operators.

It really is a pretty printout of our expression.

As I said before, once we have got a tokenizer, we can do many cool things with it. Yet another example: I have 40-50 complex arithmetic expressions and I am preparing a report that summarizes them, showing numbers greater than 100 and only multiplication operators. The report would be presented in HTML format and I will present numbers and operators using different colors. I assume that you have a solution in your mind as you read. Tokenizer allow you to use a lot of different tools to customize how your program will function. Just widen your horizon.

Before calculating expressions with the tokenizer we have written, I want to talk about the tokenizer. In some books, you will see descriptions like a tokenizer is a state machine that parses regular expressions. A State Machine exists only in explicit finite positions, such as “I am reading a number”, “I am reading an operator”, “I am transitioning from an operator to a number”, or finally, “I am transitioning from a number to an operator.” Regular Expressions we will go into more detail on later, but for now suffice it to say that they are a way for describing all expressions a language contains. We developed one when defining Terms, Factors, and Expressions.

There is a string variable in the Tokenize method of Tokenizer class above. Its name is state:

String state = "DEFAULT";

We use this variable to keep state of the tokenizer. We have got only two states: default and number. Our state machine works this way:

Its default state is DEFAULT that it takes the operators like + - * / there in the default state. If it come across any number, it gets into the NUMBER state and it stays there until there are no numbers coming out. How does it understand there is no more numbers? It checks if it is a number in every step of the loop, if there are no more numbers, it puts the all numbers it collected into the new Token object and gets the tokenizer to the DEFAULT state to make it continue getting the next upcoming tokens from the source string. This process continues until no more characters remain in the source string.

As I said, we have two states at the moment: default and number. I think you can guess that we will have more states for strings, language keywords, etc. This way, we will recognize and tokenize source string elements. Then, we will give it to our parser to produce the data structure that the interpreter needs.

State machines are very useful to tokenize strings that follow regular structure. Regular means: There is either one or more characters of same type, like one character, or couple of characters, that make the expression, or one or more digit to comprise a number. Here, regularity is non nested structure. So there will be no recursive nested code blocks or a natural language sentences.

For that reason, this type of structures is named as regular expressions. Almost every programming language has one or more regular expression parsing facilities that you use in your daily programming activities to parse, say for example, a url, input text, etc.

For parsing the structures like recursive nested code blocks or a natural language sentences, the state machine is not enough, it needs different parsing techniques, like revursive descent parsing.

We will learn how to parse programming language source using recursive descent parsing in later chapters.

Alright, we have printed our expression elements (tokens) pretty. Now, it is time to calculate our expression. All we need to do is change our calculation methods so that they will use tokens instead of characters. Also, we are not going to use GetNum method now that we have tokens in the array list as separate tokens. For getting that token’s value, all we need to do is convert the token text to an integer.

The logical change in the calculation method is that we will check token type instead of checking the Look variable value. That means character checking will be token checking.

Here is the new version of the ArithmeticExpression() method:

    public static int ArithmeticExpression()
    {        
        int result = Term();               
        while ( CurrentToken().type.equals("ADD") || 
                CurrentToken().type.equals("SUBTRACT") )        
        {
            switch(CurrentToken().type)
            {
                case "ADD":                
                    result = result + Add();
                break;
                case "SUBTRACT":
                    result = result - Subtract();
                break;                               
            }
        }        
        return result;
    }

Term is similar. I am going to give you full source code soon, I want to explain some important points first.

The CurrentToken() method we use here is to get the token that we are on at the moment. It is similar to Look variable’s duty that it holds the current token. Now, instead of using Look variable and checking its value, we use GetCurrentToken() method.

Here it is source code:

    public Token CurrentToken()
    {
        return GetToken(0);
    }

We use the GetToken() method the get one token at a time from the array list here we store the tokens. For example if you give it 0 as parameter, it gets current token according to currentTokenPosition value.

The other methods are:

  • EatToken(int offset)
  • MatchAndEat(String type)

I want to explain them one by one.

EatToken(int offset): It just eats the token(s) given in the offset. Eating means, it increases the currentTokenPosition value, so index of the array list shifts that it means it skips the eaten tokens in the array list. It is similar to technique that we applied using the GetChar and MatchAndEat methods as well as the Look variable. GetChar was getting the next character from expression string, put it into Look variable, then increase the currentCharPosition value. MatchAndEat was calling GetChar to forward the stream of characters.

MatchAndEat(String type): It gets the current token and checks it out if its type is given type. If so, it eats one token since it is the token we expect. By eating it, it gives the parser the opportunity to continue. If its type is not the expected one, it just stops the program since there is something in the expression breaking the rules.

MatchAndEat method is very important, and one of the essentials in the book. We will use it to check out tokens and make a decision according to MatchAndEat’s result. The decision I mention is branching. The greatest example of this behavior would be the use of an if/else block in Java.

The logic is the same, the only difference is we use tokens instead of characters. The rest is same in terms of increasing the index, checking out the token we have and deciding according to its contents.

You will notice that our calculation methods have very few changes. Here is the full source code, please save, compile, and execute it.

Token.java:

class Token
{
    public String text;
    public String type;

    public Token(String text, String type)
    {
        this.text = text;
        this.type = type;
    }
    public String toString()
    {
        return "Text: " + text + " Type: " + type; 
    }
}

Tokenizer.java:

import java.util.List;
import java.util.ArrayList;

class Tokenizer
{    
    public String expression = "";   
    public int currentCharPosition = 0;    
    public char Look;

    public void Init()
    {        
        GetChar();
    }    

    public void GetChar()
    {
        if (currentCharPosition < expression.length())
            Look = expression.charAt(currentCharPosition);
            
        currentCharPosition++;
    }    

    public boolean IsOp(char chr)
    {    
        return (chr == '+') || (chr == '-') || 
                (chr == '*') || (chr == '/') || 
                (chr == '(') || (chr == ')');
    }

    public String FindOpType(char chr)
    {
        String type = "UNKNOWN";
        switch(chr)
        {
            case '+':
                type = "ADD";
            break;            
            case '-':
                type = "SUBTRACT";
            break;            
            case '*':
                type = "MULTIPLY";
            break;            
            case '/':
                type = "DIVIDE";
            break;            
            case '(':
                type = "LEFT_PAREN";
            break;            
            case ')':
                type = "RIGHT_PAREN";
            break;            
        }
        return type;
    }

    public List<Token> Tokenize(String source)
    {
        List<Token> tokens = new ArrayList<Token>();
        String token = "";        
        String state = "DEFAULT";
        
        for (int index = 0; index < source.length(); index++)
        {
            char chr = source.charAt(index);
            switch(state)
            {
                case "DEFAULT":                
                    String opType = FindOpType(chr);                    
                    if (IsOp(chr))
                    {                       
                        tokens.add(new Token(Character.toString(chr), opType));
                    }                    
                    else if (Character.isDigit(chr))
                    {
                        token += chr;
                        state = "NUMBER";
                    }               
                break;                
                case "NUMBER":
                    if (Character.isDigit(chr))
                    {
                        token += chr;
                    }
                    else
                    {
                        tokens.add(new Token(token, "NUMBER"));
                        token = "";
                        state = "DEFAULT";
                        index--;
                    }                
                break;
            }
        }
        return tokens;
    }

    public void PrettyPrint(List<Token> tokens)
    {
        int numberCount = 0;
        int opCount = 0;
        
        for (Token token: tokens)
        {
            if (token.type.equals("NUMBER"))
            {
                System.out.println("Number....: " + token.text);
                numberCount++;
            }
            else
            {
                System.out.println("Operator..: " + token.type);
                opCount++;
            }                
        }
        System.out.println("You have got "+ numberCount + 
                           " different number and " + opCount
                           +" operators.");
    }
    
    public static void main(String args[])
    {
        String expression = "219+341+19";
        expression += " ";        
        
        Tokenizer tokenizer = new Tokenizer();
        List<Token> tokens = tokenizer.Tokenize(expression);
        
        System.out.println("--------------");
        tokenizer.PrettyPrint(tokens);
    }
    
}

Calculator.java:

import java.util.List;
import java.util.ArrayList;

class Calculator
{        
    public int currentTokenPosition = 0;            
    public List<Token> tokens;           
    
//~~~~Token Manipulation Methods Start~~~~
    public Token GetToken(int offset)
    {
        if (currentTokenPosition + offset >= tokens.size())
        {
            return new Token("", "NO_TOKEN");
        }
        return tokens.get(currentTokenPosition + offset);
    }        
    public Token CurrentToken()
    {
        return GetToken(0);
    }    
    // Just eats the token(s) given in the offset
    public void EatToken(int offset)
    {           
        currentTokenPosition = currentTokenPosition + offset;
    }
    // Eats the token given type and returns eaten token
    public Token MatchAndEat(String type)
    {       
        Token token = CurrentToken();
                
        if (!CurrentToken().type.equals(type))
        {
            System.out.println("Saw " + token.type +
                                " but " + type +
                                " expected");
            System.exit(0);
        }
        EatToken(1);
        return token;
    }        
//~~~~Token Manipulation Methods End~~~~    
    public int Multiply()
    {        
        MatchAndEat("MULTIPLY");
        return Factor();
    }
    public int Divide()
    {
        MatchAndEat("DIVIDE");
        return Factor();
    }   
    public int Add()
    {                
        MatchAndEat("ADD");
        return Term();
    }
    public int Subtract()
    {
        MatchAndEat("SUBTRACT");
        return Term();
    }
    public int Factor()
    {
        int result = 0;        
        if (CurrentToken().type.equals("LEFT_PAREN"))
        {
            MatchAndEat("LEFT_PAREN");
            result = ArithmeticExpression();
            MatchAndEat("RIGHT_PAREN");
        }
        else if (CurrentToken().type.equals("NUMBER"))
        {            
            result = new Integer(CurrentToken().text).intValue();
            MatchAndEat("NUMBER");
        }
        return result;
    }
    public int Term()
    {        
        int result = Factor();
                                       
        while ( CurrentToken().type.equals("MULTIPLY") ||
                CurrentToken().type.equals("DIVIDE") )
        {
            switch(CurrentToken().type)
            {
                case "MULTIPLY":
                    result = result * Multiply();
                break;
                case "DIVIDE":
                    result = result / Divide();
                break;                               
            }
        }        
        return result;
    }    
    public int ArithmeticExpression()
    {        
        int result = Term();                
        while ( CurrentToken().type.equals("ADD") ||
                CurrentToken().type.equals("SUBTRACT") )       
        {
            switch(CurrentToken().type)
            {
                case "ADD":                
                    result = result + Add();
                break;
                case "SUBTRACT":
                    result = result - Subtract();
                break;                               
            }
        }        
        return result;
    }
    public void PrettyPrint(List<Token> tokens)
    {
        int numberCount = 0;
        int opCount = 0;        
        for (Token token: tokens)
        {
            if (token.type.equals("NUMBER"))
            {
                System.out.println("Number....: " + token.text);
                numberCount++;
            }
            else
            {
                System.out.println("Operator..: " + token.type);
                opCount++;
            }                
        }
        System.out.println("You have got " + numberCount +
                           " different number and " + opCount + " operators.");
    }
    public static void main(String args[])
    {                
        String expression = "((853+92*5)*10-20/2+771)";
        expression += " ";
        Calculator calc = new Calculator();
        Tokenizer tokenizer = new Tokenizer();
        
        System.out.println("Expression: " + expression);
        System.out.println("--------------------------");        
        calc.tokens = tokenizer.Tokenize(expression);
        calc.PrettyPrint(calc.tokens);        
        System.out.println("--------------------------");
        System.out.println("Expression Result: " + calc.ArithmeticExpression());
    }    
}

Here is the results when I run it:

Expression: ((853+92*5)*10-20/2+771)
--------------------------
Operator..: LEFT_PAREN
Operator..: LEFT_PAREN
Number....: 853
Operator..: ADD
Number....: 92
Operator..: MULTIPLY
Number....: 5
Operator..: RIGHT_PAREN
Operator..: MULTIPLY
Number....: 10
Operator..: SUBTRACT
Number....: 20
Operator..: DIVIDE
Number....: 2
Operator..: ADD
Number....: 771
Operator..: RIGHT_PAREN
You have got 7 different number and 10 operators.
--------------------------
Expression Result: 13891

Such nice output! We have got the numbers and operators printed pretty and we have expression result.

Have you noticed that we have got two token types other than numbers and operators? LEFT_PAREN and RIGHT_PAREN.

Our recognition method IsOp can identify them. FindOpType performs a lookup as well. These two functions are referred to as recognizer methods.

It is another nice feature of our tokenizer and calculator. We are able to recognize and pretty print parenthesis as well as calculate the expressions that are within parenthesis!

Before closing this chapter I want to talk about the MatchAndEat method again. You know, this is the method that we use to match and then eat tokens. Here its source code:

    public Token MatchAndEat(String type)
    {       
        Token token = CurrentToken();
                
        if (!CurrentToken().type.equals(type))
        {
            System.out.println("Saw " + token.type +
                                " but " + type +
                                " expected");
            System.exit(0);
        }
        EatToken(1);
        return token;
    }

In some books, similar sources, or even in some source code of interpreters you will see, that our equivalent is consume, match, or something similar. I deliberately named it “Match and Eat” because it matches and then eats/consumes the token. Using only Match can be confusing, because its duty is not only matching, but eating the next token as well. You will see only Match in some other books or interpreters, but when you have a look, it does both. That being said, Match is very popular in compiler theory, and you will see it a lot while studying, so I wanted to make that connection here.

Alright. We have a tokenizer as of this chapter, plus, we are able to calculate arithmetic expressions by using this tokenizer. The tokenizer and calculator methods are in same program for now. We are going to separate them in the next chapters.

Also, in the next chapter, besides having separate parser for our calculator, we are going to add a boolean expression calculation feature. You may wonder what the good of calculating boolean expressions is, so let me explain briefly. Being able to calculating boolean expressions will allow us to use them in language constructs like if-else cases, while loops and similar constructs that need any type of comparison.

Our calculator is becoming a programming language interpreter. Keep on reading!

Better Token Handling

Before getting into details of parsing arithmetic and boolean expressions together, I want to change something in the token handling.

As you all rememeber, we were using the String type for the token type, in the Token class.

Besides, we were taking advantage of the switch-case structure that it supports Strings.

Even so, I am sure you noticed this approach and wondered “Couldn’t that be more effective if we use a number based typing for our token type?”

Yes, it may be more effective to use numbers. We can use integer or something similar, but, the problem with using these types for this sort of need is that they are less readable and are disorganised. You would need to name every type in advance and assign that particular type a number value, a considerable overhead for a little speed.

Luckily, we have another option which is far more elegant for our use case: the Enum type.

Java supports the Enum (Enumeration) type to give us the best of both worlds. Now, we will go back and change the type of TokenType and update all the related sections according to this change. We are doing this now, so that we are not worried about having to do a massive change in the next chapters.

First thing we will change is Token class, here its new version:

class Token
{
    public final String text;
    public final TokenType type;

    public Token(String text, TokenType type)
    {
        this.text = text;
        this.type = type;
    }
    public String toString() 
    {
        return "Text: " + text + " Type: " + type; 
    }
}

Save it to Token.java.

And here is the TokenType enum that we will use for token types:

public enum TokenType
{
    NUMBER, NEWLINE, OPERATOR, EOF, UNKNOWN,
    ADD, SUBTRACT, MULTIPLY, DIVIDE, LEFT_PAREN, RIGHT_PAREN
}

We will keep the Enums in a separate java file. Save this into TokenType.java.

Remember that we used a variable to keep the tokenizer state, and it too was String. We will change it to an Enum as well. Here it is:

enum TokenizeState 
{
    DEFAULT, OPERATOR, NUMBER
}

Save this file as TokenizeState.java. Afterwards, compile all three files. Now, we need to change all places in the Tokenizer and the Calculator that we used String comparasion. You know, we were using equals for comparing tokens since the token was a String. But now, we will use == since the token type is Enum.

Since we have updated Tokenizer.java and Calculator.java, I will reproduce the complete source files below.

Tokenizer.java:

import java.util.List;
import java.util.ArrayList;

class Tokenizer
{    
    
    public boolean IsOp(char chr)
    {   
        boolean aritOp = chr == '+' || chr == '-' || 
                          chr == '*' || chr == '/';
        return aritOp;
    }

    public TokenType FindOpType(char firstOperator)
    {
        TokenType type = TokenType.UNKNOWN;
        switch(firstOperator)
        {
            case '+':
                type = TokenType.ADD;
            break;
            case '-':
                type = TokenType.SUBTRACT;
            break;
            case '*':
                type = TokenType.MULTIPLY;
            break;
            case '/':
                type = TokenType.DIVIDE;
            break;
        }
        return type;
    }

    public boolean IsParen(char chr)
    {   
        boolean prntOp = chr == '(' || chr == ')';        
        return prntOp;
    }

    public TokenType FindParenType(char chr)
    {
        TokenType type = TokenType.UNKNOWN;
        switch(chr)
        {
            case '(':
                type = TokenType.LEFT_PAREN;
            break;
            case ')':
                type = TokenType.RIGHT_PAREN;
            break;        
        }
        return type;
    }
    
    public List<Token> Tokenize(String source)
    {
        List<Token> tokens = new ArrayList<Token>();
        String token = "";        
        TokenizeState state = TokenizeState.DEFAULT;
        
        for (int i = 0; i < source.length(); i++)
        {
            char chr = source.charAt(i);            
            switch(state)
            {
                case DEFAULT:                
                    TokenType opType = FindOpType(chr);                    
                    if (IsOp(chr))
                    {                       
                        tokens.add( new Token(Character.toString(chr), opType) );
                    }
                    else if (IsParen(chr))
                    {
                        TokenType parenType = FindParenType(chr);
                        tokens.add(new Token(Character.toString(chr), parenType));
                    }                    
                    else if (Character.isDigit(chr))
                    {
                        token += chr;
                        state = TokenizeState.NUMBER;
                    }               
                break;                
                case NUMBER:
                    if (Character.isDigit(chr))
                    {
                        token += chr;
                    }
                    else
                    {
                        tokens.add(new Token(token, TokenType.NUMBER));
                        token = "";
                        state = TokenizeState.DEFAULT;
                        i--;
                    }                
                break;
            }
        }
        return tokens;
    }
}

Calculator.java:

import java.util.List;
import java.util.ArrayList;

class Calculator
{        
    public int currentTokenPosition = 0;
    public List<Token> tokens;            

//~~~~Token Manipulation Methods Start~~~~
    public Token GetToken(int offset)
    {
        if (currentTokenPosition + offset >= tokens.size())
        {
            return new Token("", TokenType.EOF);
        }
        return tokens.get(currentTokenPosition + offset);
    }        

    public Token CurrentToken()
    {
        return GetToken(0);
    }

    public Token NextToken()
    {
        return GetToken(1);
    }
    
    public void EatToken(int offset)
    {           
        currentTokenPosition = currentTokenPosition + offset;
    }

    public Token MatchAndEat(TokenType type)
    {       
        Token token = CurrentToken();
                
        if (CurrentToken().type != type)
            System.out.println("Saw " + token.type + " but " + type + " expected");
        
        EatToken(1);
        return token;
    }

//~~~~Token Manipulation Methods End~~~~

    public int Multiply()
    {        
        MatchAndEat(TokenType.MULTIPLY);
        return Factor();
    }
    
    public int Divide()
    {
        MatchAndEat(TokenType.DIVIDE);
        return Factor();
    }   
    
    public int Add()
    {                
        MatchAndEat(TokenType.ADD);
        return Term();
    }
    
    public int Subtract()
    {
        MatchAndEat(TokenType.SUBTRACT);
        return Term();
    }
    
    public int Factor()
    {
        int result = 0;        
        if (CurrentToken().type == TokenType.LEFT_PAREN)
        {
            MatchAndEat(TokenType.LEFT_PAREN);
            result = ArithmeticExpression();
            MatchAndEat(TokenType.RIGHT_PAREN);
        }
        else if (CurrentToken().type == TokenType.NUMBER)
        {            
            result = new Integer(CurrentToken().text).intValue();
            MatchAndEat(TokenType.NUMBER);
        }
        return result;
    }
    
    public int Term()
    {        
        int result = Factor();
        while (CurrentToken().type == TokenType.MULTIPLY || 
                CurrentToken().type == TokenType.DIVIDE)
        {
            switch(CurrentToken().type)
            {
                case MULTIPLY:
                    result = result * Multiply();
                    
                break;
                case DIVIDE:                    
                    result = result / Divide();                               
                break;                               
            }
        }        
        return result;
    }
    
    public int ArithmeticExpression()
    {        
        int result = Term();                
        while (CurrentToken().type == TokenType.ADD || 
                CurrentToken().type == TokenType.SUBTRACT)
        {
            switch(CurrentToken().type)
            {
                case ADD:                
                    result = result + Add();                    
                break;
                case SUBTRACT:
                    result = result - Subtract();
                break;                               
            }
        }                
        return result;
    }
    
    public void PrettyPrint(List<Token> tokens)
    {
        int numberCount = 0;
        int opCount = 0;        
        for (Token token: tokens)
        {
            if (token.type == TokenType.NUMBER)
            {
                System.out.println("Number....: " + token.text);
                numberCount++;
            }
            else
            {
                System.out.println("Operator..: " + token.type);
                opCount++;
            }                
        }
        System.out.println("You have got "+
                            numberCount +
                            " different number and "
                            + opCount +
                            " operators.");
    }
               
    public static void main(String args[])
    {        
        String expression = "(600+3+2)*5";
        expression += " ";
        
        Calculator calc = new Calculator();
        Tokenizer tokenizer = new Tokenizer();
        
        System.out.println("Expression: " + expression);
        System.out.println("--------------------------");
        calc.tokens = tokenizer.Tokenize(expression);
        calc.PrettyPrint(calc.tokens);
        System.out.println("--------------------------");
        
        int result = calc.ArithmeticExpression();
        System.out.println( result );
    }
    
}

Save and compile them. And run Calculator.java, you should be seeing this result same as me:

Expression: (600+3+2)*5
--------------------------
Operator..: LEFT_PAREN
Number....: 600
Operator..: ADD
Number....: 3
Operator..: ADD
Number....: 2
Operator..: RIGHT_PAREN
Operator..: MULTIPLY
Number....: 5
You have got 4 different number and 5 operators.
--------------------------
3025

Let me explain why I introduced you to a string based tokenizer first before giving you this new version. I wanted to ensure that the transition from characters to tokens was as smooth as possible, and the most common way to deal with collections of characters in Java is with Strings.

I did not jump directly to the Enums since they would seem like hiding complexity in the details. In the end, what we do is compare values and we used characters and Strings in the beginning.

And now, instead of comparing Strings, we realize that we can use and so compare Enum types.

We have an Enum based tokenizer as of this section. We can continue our work on parsing arithmetic and boolean expressions.

Chapter 4 - Parsing Programming Language Constructs

Statements and Code Blocks

In the last chapter, I told you that I am going to show you how to parse and interpret code blocks, while loops and if-else constructs.

First we are going to learn how to parse code blocks in this chapter.

Lets write a sample script first. We are going to have our interpreter run it:

println 1
wait 2000
println 2

This simple script consists of just three commands:

  • println 1
  • wait 2000
  • println 2

Let’s refer to these commands as “statements.”

Now, I want you to recall the last section. We learned and demonstrated how to put expressions into a linked list and evaluate those expressions by traversing the linked list.

Let me ask, what is the difference between statements and expressions? We are able to store both of them in linked lists, and call an evaluation function on each element. In the end, the linked list does not have to know what we store inside each node, be they expressions or statements. It only needs to know the object type of the expressions and the statements, nothing more.

So, if we use the same container object for statements as we used for expressions then we can store statements in the same linked list.

And once the statements are in the linked list, evaluating them follows the same pattern we used for expressions. We will traverse the linked list and invoke the eval method of the statement objects.

This is the logic behind our interpreter. The only remaining part is how we will parse statements and store them in the linked list. Another term we will use for this linked list that we will use to store statements is Abstract Syntax Tree.

We learned how to parse expressions from the previous chapter. We will use same techniques and structure to parse statements. We will use the match and eat technique again and create node objects for the statements before finally putting them into our Abstract Syntax Tree.

Writing out the grammar we will model can help to plan how we will parse our language. Here is the grammar for a statement:

<statement> ::= <print> | <println> | <wait> 

For now, we only have three statements: print, println and wait.

Let’s get the demo program from the previous section. I’ve reproduced it here:

    public static void main(String args[])
    {
        Tokenizer tokenizer = new Tokenizer();        
        Calculator calc = new Calculator();
        
        List<String> expresssionList = new LinkedList<String>();
        
        // Remember, we use a trailing white space as a terminating character!
        expresssionList.add("(100*2+2)*2+5>=500 ");
        expresssionList.add("((5+1)*100-2+3) "); 
        expresssionList.add("100-30/2+13>=10 ");
        expresssionList.add("(853+92*5)*10-20/2+771 ");
        expresssionList.add("(5)*2 ");
        
        List<Node> commandList = new LinkedList<Node>();
        for (String expression: expresssionList)
        {
            System.out.println("Expression: "+ expression);
            calc.currentTokenPosition = 0;
            calc.tokens = tokenizer.Tokenize(expression);
            Node node = calc.BooleanExpression();
            if (node != null)
            {
                commandList.add(node);
            }            
        }     

        System.out.println("");
        System.out.println("Now, lets calculate expressions...");
        for (Node command: commandList)
            System.out.println("Expression Result: " + command.eval());
    }

In this program, there are a few expressions. We first put their node objects into a linked list. In this first linked list, there are node objects. They are not evaluated, they just are objects in the Abstract Syntax Tree.

Once we get that node objects into the linked list named commandList, we traverse commandList, get the node objects one by one, and execute their eval methods. When the traversal ends, all expressions will have been evaluated.

We will do same thing for statements. To do that, we need a way to encapsulate statements in node objects as we did for expressions so that we can put them into Abstract Syntax Tree.

Here is the Node subclasses for the print, println and wait statements:

PrintNode:

public class PrintNode extends Node
{
    public Node expression;
    public String type;
    
    public PrintNode() {}
    
    public PrintNode(Node expression, String type)
    {
        this.expression = expression;
        this.type = type;
    }
    
    public Object eval()
    {
        Object writee = expression.eval();
        
        if (type.equals("sameline")) System.out.print(writee);
        else if (type.equals("newline")) System.out.println(writee);
        
        return writee;
    }    
}

WaitNode:

public class WaitNode extends Node
{
    public Node interval;
            
    public WaitNode() {}
    
    public WaitNode(Node interval)
    {
        this.interval = interval;
    }
    
    public Object eval()
    {
        Integer waitAmount = (Integer) interval.eval();
        try
        {
            Thread.sleep(waitAmount.intValue());
        }
        catch (Exception e)
        {
            System.out.println("Error in WaitNode.eval() method" + e);
        }
        return waitAmount;
    }        
}

Please note that the PrintNode class is for both print and println statements. We use it for both by using a second parameter in its constructor. If the “type” parameter is “sameline”, it is a print statement, if the type is “newline” then it is a println statement. We will use only println in this section but print statement will be necessary for upcoming sections.

These classes adhere to our standard of being able to stored in the Abstract Syntax Tree like expression nodes:

  • They are Node subclasses
  • They have eval methods

I want to demonstrate their usage now. Please take the below example and save it into the StatementsDemo.java:

import java.util.List;
import java.util.LinkedList;

public class StatementsDemo
{
    public static void main(String args[])
    {
        Node firstMsg = new PrintNode(new NumberNode(1), "newline");
        Node secondMsg = new PrintNode(new NumberNode(2), "newline");
        
        Node wait = new WaitNode(new NumberNode(new Integer(2000)));

        List<Node> script = new LinkedList<Node>();
        script.add(firstMsg);
        script.add(wait);
        script.add(secondMsg);
        
        for (Node statement:script)
            statement.eval();
    }
}

Compile and run this class. What do you see? It writes 1 and waits two seconds and then writes 2.

What it does is execute our three statements:

  • A print statement, its parameter is 1
  • A wait statement, its parameter is 2000
  • A print statement, its parameter is 2

Encapsulating the statements in the Node type classes is really useful. This way, we can put these statements into the Abstract Syntaxt Tree.

But we need something more to be able to parse them. At the moment, as you see in above example, we put the statements into the Abstract Syntax Tree manually, but interpreters are for automating that process. We can’t know all scripts ahead of time, parse them in our minds and then create statement objects one by one manually. That’s what interpreters are for.

Our scripts need to be read from a file and fed into the interpreter.

Here is our script again:

println 1
wait 2000
println 2

Save this script into a file named script.con

To parse this script, we need a method like Expression, ArithmeticExpression, BooleanExpression, Term, Factor, etc. that we used for parsing expressions.

Before writing that method, I want to do some tidying up first. So far in the book, we’ve used Calculator.java that was designed to calculate arithmetic and boolean expressions. All parsing methods are presently in Calculator.java.

I think it’s time we changed that from a calculator to an actual parser that we can use more generally.

Time to back up all source files from the previous section and rename Calculator.java to Parser.java. We are going to add all parsing methods to Parser.java moving forward. Calculator.java will no longer be referenced.

After renaming it to the Parser.java, lets add a constructor and a method for getting tokens of the parser.

Here is the constructor methods:

    public Parser() {}    
    public Parser(List<Token> tokens)
    {
        this.tokens = tokens;
    }

And here is the getTokens() method:

    public List<Token> getTokens()
    {
        return tokens;
    }

Please add them to the Parser.java at the top after the currentTokenPosition and tokens variables declarations.

After this tidying up, let’s write a new method named Statement():

    public Node Statement()
    {
        Node node = null;        
        TokenType type = CurrentToken().type;

        if (type == TokenType.PRINT)
        {
            MatchAndEat(TokenType.PRINT);
            node = new PrintNode(Expression(), "sameline");
        }
        else if (type == TokenType.PRINTLN)
        {   
            MatchAndEat(TokenType.PRINTLN);
            node = new PrintNode(Expression(), "newline");
        }
        else if (type == TokenType.WAIT)
        {
            MatchAndEat(TokenType.WAIT);
            node = new WaitNode(Expression());
        }
        else
        {        
            Util.Writeln("Unknown language construct: " 
                         + CurrentToken().text);
            System.exit(0);
        }
        return node;       
    }

Add this method to Parser.java.

I am going to explain how this method works soon, but first I want to run the simple script from the beginning of this lesson using this method to show it works!

We need a method to read the source from file, then give the file to the tokenizer. After getting the tokens from the tokenizer, our parser will take over and parse statements one by one. It will put node objects in the Abstract Syntax Tree. The rest is just evaluating the statements in the Abstract Syntax Tree as we’ve done before.

I have collected all of these steps into a new class: Interpreter.java:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.FileNotFoundException;
import java.nio.charset.Charset;
import java.util.List;
import java.util.LinkedList;

public class Interpreter
{
    public static void main(String args[])
    {                
        boolean debug = false;
        
        if (args.length < 1) { Util.Writeln("Usage: Demo <script>"); return; }
        if (args.length > 1) { if (args[1].equals("debug")) debug = true; }        
        
        Interpreter interpreter = new Interpreter();
            
        String sourceCode = interpreter.ReadFile(args[0]);
        interpreter.Interpret(sourceCode, debug);
    }

    public void Interpret(String source, boolean debug)
    {        
        Tokenizer tokenizer = new Tokenizer();
        Parser parser = new Parser(tokenizer.Tokenize(source));
        
        if (debug) DumpTokens(parser);

        List<Node> script = new LinkedList<Node>();
        script.add(parser.Statement());
        script.add(parser.Statement());
        script.add(parser.Statement());
        
        for (Node statement:script)
            statement.eval();        
    }

    public void DumpTokens(Parser parser)
    {
        for (Token token: parser.getTokens())
            Util.Writeln("Type: " + token.type + " Text: " + token.text+" ");
        Util.Writeln();    
    }
        
    private String ReadFile(String path)
    {
        FileInputStream stream = null;
        InputStreamReader input = null;
        try
        {
            stream = new FileInputStream(path);                       
            input = new InputStreamReader(stream, Charset.defaultCharset());
                
            Reader reader = new BufferedReader(input);
            StringBuilder builder = new StringBuilder();
            char[] buffer = new char[8192];
            int read;
            while ((read = reader.read(buffer, 0, buffer.length)) > 0)
            {
                builder.append(buffer, 0, read);
            }
            //Keep the space in the end
            builder.append(" ");
            return builder.toString();
        }
        catch (FileNotFoundException e)
        {
            String errMsg = "FILE NOT FOUND. ";
            String sourceInfo = "Error in Interpreter.java->" 
                                + "ReadFile(String path) method. ";
            Util.Writeln(sourceInfo + errMsg);
            System.exit(0);
        }
        catch (IOException e)
        {
            String errMsg = "Error while reading the script. ";
            String sourceInfo = "Error in Interpreter.java->" 
                                + "ReadFile(String path) method. ";
            Util.Writeln(sourceInfo + errMsg);
            System.exit(0);
        }
        catch (Exception e)
        {
            String errMsg = "Error while reading the script. ";
            String sourceInfo = "Error in Interpreter.java->" 
                                + "ReadFile(String path) method. ";
            Util.Writeln(sourceInfo + errMsg + e);
            System.exit(0);
        }        
        finally
        {
            try
            {
                input.close();
                stream.close();
            }
            catch (Exception e)
            {
                String errMsg = "Error while closing a stream or a stream reader. ";
                String sourceInfo = "Error in Interpreter.java->" 
                                    + "ReadFile(String path) method. ";
                Util.Writeln(sourceInfo + errMsg + e);
                System.exit(0);
            }            
        }
                
        return null;
    }    
}

The Interpreter class has three methods apart from the main method:

  • ReadFile
  • Interpret
  • DumpTokens

ReadFile reads the script source code and returns the source code as a string.The Interpret method creates a new Tokenizer instance and new Parser instance with the script tokens in the constructor.

These parts tokenize the source and creates a Parser object with those tokens.

Then the second part of the Interpret method takes over:

        if (debug) DumpTokens(parser);
        
        List<Node> script = new LinkedList<Node>();
        script.add(parser.Statement());
        script.add(parser.Statement());
        script.add(parser.Statement());
        
        for (Node statement:script)
            statement.eval();        

It calls the DumpTokens if the debug parameter is true. Debug will be true if we add the “debug” keyword to the command line parameters while we are running “java Interpeter” command to run it.

The DumpTokens method is for debugging. It prints all the tokens in the script.

After that, we call the Statement() method three times since we have three different statements. The Statement() method parses the current statement at every call and returns a node object for that particular statement. They get added to Abstract Syntax Tree named script.

With that, we have completed the parsing part.

It is not enough to just run scripts. There must be an update in the tokenizer as well since we have new token types: PRINT, PRINTLN and WAIT and KEYWORD.

We need to add them to TokenType.java; here its new version:

public enum TokenType
{
    NUMBER, NEWLINE, OPERATOR, EOF, UNKNOWN,
    ADD, SUBTRACT, MULTIPLY, DIVIDE, LEFT_PAREN, 
    RIGHT_PAREN, LESS, GREATER, EQUAL, NOTEQUAL,
    LESSEQUAL, GREATEREQUAL, ASSIGNMENT, NOT, OR, AND,
    KEYWORD, PRINT, PRINTLN, WAIT, END
}

MAIN, END, PRINT, PRINTLN, WAIT explains themselves. We use the KEYWORD type for a new state in the tokenizer. Lets remember the states our tokenizer can be in so far:

  • DEFAULT
  • OPERATOR
  • NUMBER

And now, we have a new one: KEYWORD

Let’s add it to TokenizeState first; here is the new version of the TokenizeState enum:

enum TokenizeState 
{
    DEFAULT, OPERATOR, NUMBER, KEYWORD
}

Let me explain the KEYWORD tokenizer state type. When the tokenizer sees a token other than a number or operator in the source code of our scripts, it will treat it as though it is a language keyword. It needs to continue in a new state to get whole of that keyword. Here is the control we will add just below the parenthesis control in The Tokenizer.java:

                    else if (Character.isLetter(chr))
                    {
                        tokenText += chr;
                        state = TokenizeState.KEYWORD;
                    }

It checks if chr is a letter. If so, it adds it to the tokenText variable and get the tokenizer to the KEYWORD state. So, we need a case for KEYWORD state, here it is:

                case KEYWORD:
                        if (Character.isLetterOrDigit(chr))
                        {
                            tokenText += chr;
                        }
                        else
                        {
                            TokenType type = FindStatementType(tokenText);
                            tokens.add(new Token(tokenText, type));
                            tokenText = "";
                            state = TokenizeState.DEFAULT;
                            index--;
                        }
                break;                

You can add it below NUMBER state case. Here in this state case, we have used a new method: FindStatementType(String str) It finds the token type of given string:

    public TokenType FindStatementType(String str)
    {
        TokenType type = TokenType.UNKNOWN;        
        switch(str)
        {
            case "print":
                type = TokenType.PRINT;
            break;
            case "println":
                type = TokenType.PRINTLN;
            break;            
            case "wait":
                type = TokenType.WAIT;
            break;            
            default:
                type = TokenType.KEYWORD;
        }
        return type;
    }

Do these changes in the Tokenizer.java and save it. Now, we are ready to run our test script. Compile all files with this command:

javac *.java command and then run this command:

java Interpreter script.con

Here is the output:

1
2

It writes 1 and waits two seconds and writes 2. Same as the earlier output with the StatementsDemo class.

We have not added these statements in the script manually, though. We have just given the source code to the parser. It took care of the rest by parsing statements so that the interpreter could evaluate them.

I can explain the Statement() method now, recall:

    public Node Statement()
    {
        Node node = null;        
        TokenType type = CurrentToken().type;

        if (type == TokenType.PRINT)
        {
            MatchAndEat(TokenType.PRINT);
            node = new PrintNode(Expression(), "sameline");
        }
        if (type == TokenType.PRINTLN)
        {   
            MatchAndEat(TokenType.PRINTLN);
            node = new PrintNode(Expression(), "newline");
        }
        else if (type == TokenType.WAIT)
        {
            MatchAndEat(TokenType.WAIT);
            node = new WaitNode(Expression());
        }
        else
        {        
            Util.Writeln("Unknown language construct: " 
                         + CurrentToken().text);
            System.exit(0);
        }
        return node;       
    }

Here is what it does:

  • It gets the current token and checks if it matches any of the statements
  • It matches-eats print keyword if the statement is print and create a PrintNode object. Its parameter is obtained by calling the Expression() method. You may remember that we parse expressions with the Expression() method, and a number is an expression
  • It matches-eats the println keyword if the statement is println and creates a PrintNode object. Its parameter is obtained by calling the Expression() method. You will remember that we parsed expressions with the Expression() method, and a number is an expression
  • It matches-eats wait keyword if the statement is wait and create a WaitNode object. Its parameter is taken by calling Expression() method. You will remember that we parse expressions with Expression() method and a number is an expression
  • And finally, if the current token does not match any statement keyword, it exits the program
  • If none of these cases are triggered, it is necessary to exit beacuse the keyword is unknown.

So far so good. We can parse statements and have them interpreted, but there is a slight problem. Let me show what it is:

    public void Interpret(String source, boolean debug)
    {        
        Tokenizer tokenizer = new Tokenizer();
        Parser parser = new Parser(tokenizer.Tokenize(source));
        
        if (debug) DumpTokens(parser);

        List<Node> script = new LinkedList<Node>();
        script.add(parser.Statement());
        script.add(parser.Statement());
        script.add(parser.Statement());
        
        for (Node statement:script)
            statement.eval();        
    }

This is the Interpret method of the Interpreter.java. The part we parse the statements is:

        script.add(parser.Statement());
        script.add(parser.Statement());
        script.add(parser.Statement());

It only parses three different statements. Since not all the scripts have three statements, we need to adjust that. They can have arbitrary amount of statements.

The solution is, we can get all statements at once using a loop. Lets add this method to Parser.java:

    public List Block()
    {   
        List<Node> statements = new LinkedList<Node>();        
        while ( CurrentToken().type != TokenType.END)
        {                       
            statements.add(Statement());
        }                        
        MatchAndEat(TokenType.END);               
        return statements;
    }

When we call this method, it starts a loop and continue it till the current token becomes “END”. We use END keyword to understand that we reach the end of the script. This way, we can put any amount of statements before the END keyword, we are not limited to three statements or something like this.

This way, we automatize statement parsing. It calls the Statement() method in the while loop and so in every step of the loop, a statement gets parsed.

Let’s perform the necessary updates in the tokenizer for our next new keyword END. First, add END keyword to TokenType.java. Then update the FindStatementType() method in the Tokenizer.java, here its new version:

    public TokenType FindStatementType(String str)
    {
        TokenType type = TokenType.UNKNOWN;        
        switch(str)
        {
            case "end":
                type = TokenType.END;
            break;        
            case "print":
                type = TokenType.PRINT;
            break;
            case "println":
                type = TokenType.PRINTLN;
            break;            
            case "wait":
                type = TokenType.WAIT;
            break;            
            default:
                type = TokenType.KEYWORD;
        }
        return type;
    }

We can update the Interpret method of Interpreter.java now:

    public void Interpret(String source, boolean debug)
    {        
        Tokenizer tokenizer = new Tokenizer();
        Parser parser = new Parser(tokenizer.Tokenize(source));
        
        if (debug) DumpTokens(parser);

        List<Node> script = parser.Block();
        
        for (Node statement:script)
            statement.eval();        
    }

We need to update the script.con as well since we need to put the END keyword at the end of every script:

println 1
wait 2000
println 2
end

Compile all files with javac *.java command and run this command:

java Interpreter script.con

Here is the output:

1
2

It is the same output as above. But there is a huge difference in terms of techniques between the two implementations.

We are able to parse arbitrary amount of statements, this is a huge achievement.

We can refer to this “arbitrary amount of statements” as a “statement block” or “code block”. Since code blocks, is more common in programming languages, we’ll use that. Here is the grammar for code blocks:

<block> ::= ( <statement> )+

It reflects to the Block() method we have written above. In a code block, there are one or more statements.

Before closing this chapter, I want to add a bit of syntactic sugar to our parser. We have added END keyword to our parser to be able to understand that we are at the end of the script. There are statements before the END keyword. But it seems a bit unbalanced. It is good idea to put a starter keyword to the beginning of all scripts:

script
    println 1
    wait 2000
    println 2
end

It is more tidy. Save the script.con this way that there is script keyword at the beginning.

As we have done for END keyword, we will update the tokenizer for script keyword. Add SCRIPT token type to TokenType.java, here its new version:

public enum TokenType
{
    NUMBER, NEWLINE, OPERATOR, EOF, UNKNOWN,
    ADD, SUBTRACT, MULTIPLY, DIVIDE, LEFT_PAREN, 
    RIGHT_PAREN, LESS, GREATER, EQUAL, NOTEQUAL,
    LESSEQUAL, GREATEREQUAL, ASSIGNMENT, NOT, OR, AND,
    KEYWORD, PRINT, PRINTLN, WAIT, END, SCRIPT
}

And then update the FindStatementType() method in the Tokenizer.java, by adding

            case "script":
                type = TokenType.END;
            break;                

case. Here its new version:

    public TokenType FindStatementType(String str)
    {
        TokenType type = TokenType.UNKNOWN;        
        switch(str)
        {
            case "script":
                type = TokenType.SCRIPT;
            break;                
            case "end":
                type = TokenType.END;
            break;        
            case "print":
                type = TokenType.PRINT;
            break;
            case "println":
                type = TokenType.PRINTLN;
            break;            
            case "wait":
                type = TokenType.WAIT;
            break;            
            default:
                type = TokenType.KEYWORD;
        }
        return type;
    }

Finally, we need to update the Interpret() method in Interpreter.java. We also need to match and eat the script keyword before calling the Block() method:

Here is the new version of the Interprer() method:

    public void Interpret(String source, boolean debug)
    {        
        Tokenizer tokenizer = new Tokenizer();
        Parser parser = new Parser(tokenizer.Tokenize(source));
        
        if (debug) DumpTokens(parser);

        parser.MatchAndEat(TokenType.SCRIPT);
        List<Node> script = parser.Block();
        
        for (Node statement:script)
            statement.eval();        
    }

Compile and run the script.con as we have done above. Here is the output:

1
2

It is the same output as before.

We have got the script keyword at the beginning and the end keyword at the end of all scripts as syntactic sugar. Later, you’ll see why this can be so useful for visualizing code blocks.

As of this chapter, we are able to parse arbitrary amount of statements, referred to as code blocks. We will use it in the upcoming chapters to parse while loops body, if-else cases and more! Keep on reading!

While Loop

We have learned how to parse code blocks and variables in the previous sections.

Code blocks are necessary for parsing the body section of while loops. Variables are necessary because they are used in the conditional section of while loops.

Lets write a simple script first:

script
    count = 1
    while count <= 5
        println "Hello World!"
        wait 500
        count = count + 1
    end
end

Save it as while-demo.con.

The first question I expect you to ask is How can I evaluate println, wait and the count = count + 1 statements in the while loop body? We can’t evaluate them the way we have done in the previous sections. Since they don’t belong to whole script, they get evaluated only if the while condition evaluates to true.

Additionally, the code within the body of the while loop should have no impact on other parts of the script. We need to ensure that, otherwise very odd behavior can manifest without warning.

The key thing here is, the block belongs to the while loop.

This type of code blocks must be treated as like separate scripts inside the main script. They have an owner like a while loop.

Lets remember what we used to evaluate a whole script in the previous section:

  • We parse statements using the Block() method that parses and puts node objects into a linked list
  • We traverse this linked list and call the eval method of the statements in each step of the traversing loop

But this time, we can not put the statements into our list. Because while loop is yet another statement as well and it has two parts:

  • Condition
  • Body

It is not like print, println and wait statements because its body is a code block itself. To be able to handle a while loop with its condition and body, we need to encapsulate both condition and body.

For encapsulating a while loop body, we need a new class for encapsulating code blocks. We can use the BlockNode class for that:

import java.util.List;
import java.util.LinkedList;

public class BlockNode extends Node
{
    private List<Node> statements;

    public BlockNode(List<Node> statements)
    {    
        this.statements = statements;
    }
   
    public Object eval()
    {
        Object ret = null;
        
        for (Node statement : statements)
        {
            ret = statement.eval();
        }
        return ret;
    }

    public Node get(int index)
    {
        return statements.get(index);
    }

    protected List<Node> getStatements()
    {
        return statements;
    }
    
    public String toString()
    {
        String str = "";
        for (Node statement: statements)
            str = str + statement + "\n";
        
        return str;       
    }
}

Its member variable statements is for storing the statements that belong to a particular code block. Once we get them in this variable, we can evaluate them by calling the eval() method of this class on that code block object.

Remember, we use Block() method for parsing code blocks. It returns linked list at the moment. We need to update it so that it returns a BlockNode object. Here its new version:

    public BlockNode Block()
    {   
        List<Node> statements = new LinkedList<Node>();        
        while ( CurrentToken().type != TokenType.END)
        {                       
            statements.add(Statement());
        }                        
        MatchAndEat(TokenType.END);               
        return new BlockNode(statements);
    }

Add this method to Parser.java.

This method parses any code block and returns a BlockNode object.

Lets update the Interpret method of Interpreter.java to use new version of the Block() method. Here is the new version:

    public void Interpret(String source, boolean debug)
    {        
        Tokenizer tokenizer = new Tokenizer();
        Parser parser = new Parser(tokenizer.Tokenize(source));
        
        if (debug) DumpTokens(parser);
        
        parser.MatchAndEat(TokenType.SCRIPT);
        Node script = parser.Block();
        
        script.eval();
    }

The difference is, we do not need this loop below:

        for (Node statement:script)
            statement.eval();        

to traverse the Abstract Syntax Tree. We removed this code snippet and put script.eval() now that we have a BlockNode object that encapsulates all statements of the script. The only thing to do next is just call the eval() method on this object. It evaluates all statements one by one by traversing the linked list that it holds, full of the scripts statements.

Now we can get back to parsing while loops.

        else if (IsWhile())
        {                           
            node = While();
        }

Add this control to Statement() method in Parser.java just below the assignment control.

The recognizer method that we use is IsWhile(), add this method to Parser.java as well.

    public boolean IsWhile()
    {
        return CurrentToken().type == TokenType.WHILE;
    }

The While() method we used above in the control is for parsing the whole while loop:

    public Node While()
    {
        Node condition, body;        
        MatchAndEat(TokenType.WHILE);
        condition = Expression();
        body = Block();        
        return new WhileNode(condition, body);
    }

It first matches and eats the while keyword. Then it gets the while loop condition by calling the Expression() method that it does parse the boolean expressions. A note there: while loop conditions are a boolean expression.

Then, it calls the Block() method to parse while loop body. Our Block() method is able to parse any code block and get that code block statements into a BlockNode object. So, we use it for while loop body as well.

Then, finally, we need to put this condition and body into an object so that we can store it in the Abstract Syntax Tree.

An interesting takeaway from this technique is that we are creating scope for any object that holds a Block() call. So not only does the entire script get contained in a Block, but while loop bodies are contained in Blocks as well. If you’re thinking forward, then you may also be considering using Blocks for if-else body sections as well.

It is a very good example to the structure of the programming languages that they are nested. A while loop can include an if case and then that particular if case can include a while loop. All of their bodies are code blocks and we use Block method to parse that code blocks.

Here is the WileNode class that we use for encapsulating the while loops:

import java.util.List;
import java.util.LinkedList;

public class WhileNode extends Node
{
    public Node condition;
    public Node body;
    
    public WhileNode() {}
    
    public WhileNode(Node condition, Node body)
    {
        this.condition = condition;
        this.body = body;
    }
    
    public Object eval()
    {
        Object ret = null;        
        while ( ((Boolean) condition.eval()).booleanValue() )
        {            
            ret = body.eval();
        }                       
        return ret;
    }    
}

We need this class to put WhileNode objects to the Abstract Syntax Tree.

The last part we need to update is tokenizer. Because we have a new keyword: while.

Add WHILE token type to TokenType.java. Then update the FindStatementType() method in the Tokenizer.java. Here its new version:

    public TokenType FindStatementType(String str)
    {
        TokenType type = TokenType.UNKNOWN;        
        switch(str)
        {
            case "script":
                type = TokenType.SCRIPT;
            break;                
            case "end":
                type = TokenType.END;
            break;        
            case "while":
                type = TokenType.WHILE;
            break;            
            case "print":
                type = TokenType.PRINT;
            break;
            case "println":
                type = TokenType.PRINTLN;
            break;            
            case "wait":
                type = TokenType.WAIT;
            break;            
            default:
                type = TokenType.KEYWORD;
        }
        return type;
    }

Now, it is able to recognize the while keyword. Compile all files and run this command:

java Interpreter while-demo.con

Here is the output:

Hello World!
Hello World!
Hello World!
Hello World!
Hello World!

Congratulations. You can loop!

script count = 1 while count <= 5 println “Hello World!” wait 500 count = count + 1 end end

The script starts with assingning 1 to count variable. Then it loops in the while loop till the count variable gets to 6 that we increase it in each step of the loop. It prints Hello World! in each step of the loop and waits 0.5 seconds after printing.

Please take a second to notice that count’s final value is 6 because we used <= instead of strictly <.

We have completed parsing and interpreting while loops. Now it is time to add if-else cases to our interpreter. Keep on reading!

Arrays

We have implemented variable parsing in the last section. Arrays function similarly to variables.

The difference is that they hold multiple values, instead of just one. The different values can be accessed by their index in the array.

Parsing arrays is based on variables. We can say, once you add index support to variables, they become arrays.

As usual, I am going to start by defining the structure of what we are parsing.

Arrays will have two functionalities, definition and access

Example array definitions:

fruits = ["Apple", "Orange"]
orders = [1234, 1235]

The first one is an array of strings, the second one is an array of numbers.

As stated earlier, we access elements of an array by their index:

myFavoriteFruit = fruits[0]
fruits[1] = "Banana"
yourlastOrder = orders[index]
println orders[index-1]
updateRecord(recordList[5])

By looking at the structutures above, we are able to conclude that we will need to support two kinds of parsing for arrays:

  • Parsing array definition
  • Parsing array access

And here is the grammar for array definition:

<array> ::= (<left_bracket> (<number> | <string> | <variable>) 
     		[<comma> (<number> | <string> | <variable>)]* <right_bracket>)

With our grammar defined, we can get started parsing array defintion.

Array definition is really just a form of assignment, so we will handle it in the assignment section of the parser. Remember that we are handling variable assignments in the Statement() method using below control:

        if (IsAssignment())
        {
            node = Assignment();
        }

We are going to update it to allow it to parse an array definition as well. What we need to do is put an appropriate code snippet into the Assignment() method. Here is the new version of the Assignment() method, remove the previous one from Parser.java and add this version:

    public Node Assignment()
    {
        Node node = null;
        String name = MatchAndEat(TokenType.KEYWORD).text;
        MatchAndEat(TokenType.ASSIGNMENT);
                    
        if (CurrentToken().type == TokenType.LEFT_BRACKET)
        {                
            node = ArrayDefinition(name);
        }
        else
        {
            Node value = Expression();
            node = new AssignmentNode(name, value, this);
        }
        return node;
    }

What we added is an extra control to understand if the assignment is an array definition. Here is what we added:

        if (CurrentToken().type == TokenType.LEFT_BRACKET)
        {                
            node = ArrayDefinition(name);
        }

It checks if there is left bracket after the equals sign, and if so, it creates the array definition by calling the ArrayDefinition() method. If it does not find a left bracket after the equals sign, we know that we’re dealing with a variable.

Here it is the ArrayDefinition method, invoked above:

    public Node ArrayDefinition(String name)
    {               
        List<Node> elements = new ArrayList<Node>();
        MatchAndEat(TokenType.LEFT_BRACKET);        
        if (CurrentToken().type != TokenType.RIGHT_BRACKET)
        {
            elements.add(Expression());
            while (CurrentToken().type != TokenType.RIGHT_BRACKET)
            {
                MatchAndEat(TokenType.COMMA);
                elements.add(Expression());
            }
        }
        MatchAndEat(TokenType.RIGHT_BRACKET);        
        return new AssignmentNode(name, new ArrayNode(elements), this);
    }                 

Add this method to Parser.java.

In this method, we eat left bracket in the beginning and get first element of the array using the Expression() method. Then we get the array elements in the loop the same way. We keep array elements in an array list to allow for dynamic expansion and constant speed index access. The loop ends when it sees a right bracket. Afterwards, we create a new AssignmentNode object for this array. It’s worth noting that every array definition is an assignment as well. In the constructor of AssignmentNode, you will notice that there is an ArrayNode object creation.

The ArrayNode object is for encapsulating array definitions. Here its source code:

import java.util.ArrayList;
import java.util.List;

public class ArrayNode extends Node
{
    private List<Node> elements;

    public ArrayNode(List<Node> elements)
    {
        this.elements = elements;
    }

    public Object eval()
    {
        List<Object> items = new ArrayList<Object>(elements.size());
        for (Node node : elements)
        {
            items.add(node.eval());
        }
        return items;
    }
}

Besides the obvious changes above, we need to also update the tokenizer since we have used new token types in our Assignment() and ArrayDefinition() methods.

Let’s add these token types to the TokenType enum: COMMA, LEFT_BRACKET and RIGHT_BRACKET.

Then, we need to update the IsParen and FindParenType methods of Tokenizer.java. Here is their new version:

    public boolean IsParen(char chr)
    {   
        boolean prntOp = chr == '(' || chr == ')';
        boolean brktOp = chr == '[' || chr == ']';
        boolean puncOp = chr == ',';
        
        return prntOp || brktOp || puncOp;        
    }

    public TokenType FindParenType(char chr)
    {
        TokenType type = TokenType.UNKNOWN;
        switch(chr)
        {
            case '(':
                type = TokenType.LEFT_PAREN;
            break;
            case ')':
                type = TokenType.RIGHT_PAREN;
            break;        
            case '[':
                type = TokenType.LEFT_BRACKET;
            break;
            case ']':
                type = TokenType.RIGHT_BRACKET;
            break;
            case ',':
                type = TokenType.COMMA;
            break;                                                           
        }
        return type;
    }

Don’t forget to delete previous versions and replace them with these new ones in the Tokenizer.java rather than simply adding them to the file. After these updates, our tokenizer is able to recognize brackets.

Also we need two new methods to recognize the COMMA token type:

The first one is IsPunc() to recognize if the token is a comma.

    public boolean IsPunc(char chr)
    {   
        boolean puncOp = chr == ',';
        return puncOp;        
    }

Second, FindPuncType() to find out the token type if it is a comma:

    public TokenType FindPuncType(char firstOperator)
    {
        TokenType type = TokenType.UNKNOWN;
        switch(firstOperator)
        {
            case ',':
                type = TokenType.COMMA;
            break;
        }
        return type;
    }

Add these methods to Tokenizer.java.

We only support one punctuation type at the moment. But we designed these methods thinking that we can eventually expand our tokenizer to recognize additional punctuation types like dot, colon, semicolon, etc.

After adding these two methods, we need to update the Tokenize() method of the Tokenizer.java to make it able to tokenize punctuation types. Add below control to Tokenize() method in the Tokenizer.java:

else if (IsPunc(chr))
{
    TokenType puncType = FindPuncType(chr);
    tokens.add(new Token(Character.toString(chr), puncType));
}                                                                    

It must be in the DEFAULT case of the switch control for parsing..

Before compiling all of our files with the new additions, remember to add import java.util.ArrayList to Parser.java since we have started using array lists.

Now, compile all files with the javac *.java command. There shouldn’t be any errors.

We have completed parsing array definition. We need to implement array access now. Let’s recall the different array access examples we saw in the beginning of this section:

myFavoriteFruit = fruits[0]
fruits[1] = "Banana"
yourlastOrder = orders[index]
println orders[index-1]
updateRecord(recordList[5])

Arrays are accessed using their indices. This is the first thing we need to parse. Secondly, we need to parse assignment when there is an array accessed on the left or right side of the equals sign.

As I said before, arrays are like variables, only difference is indices. So, we can treat them like variables. This gives us a clue: we are going to update the Variable() method for handling array access. Remember, we are parsing variables using the Variable() method that is called in Factor().

This clue brings us some simplicity: since the Factor() method is a part of a chain of expression parsing, once we get array access working, we can use array elements anywhere that we use expressions.

A quick example:

result = sum[0] + sum[1] + sum[2]

See that? The right side is an expression. And whole line is an assignment.

So, after having the necessary logic, it is new version of the Variable() method:

    public Node Variable()
    {
        Token token = MatchAndEat(TokenType.KEYWORD);
        Node varNode = new VariableNode(token.text, this);
                
        // Handle array access here
        if (CurrentToken().type == TokenType.LEFT_BRACKET)
        {                                
            MatchAndEat(TokenType.LEFT_BRACKET);
            Node key = Expression();
            MatchAndEat(TokenType.RIGHT_BRACKET);
            return new LookupNode((VariableNode) varNode, key);
        }               
        else return varNode;    
    }

Please delete the old version and add this new one to the Parser.java.

Let me explain how the array access works here.

We added a new control that it utilizes to understand if a given variable is an array element. If it is an array, it matchs-eats left bracket, gets the expression inside brackets and matchs-eats the right bracket. Then, it creates a new object of LookupNode.

I want to point out something important here: we get the expression between brackets, right? It is the another example of the usage of the expressions! We use expression parsing the get the expression that is given as array indice. Nice.

Now, here is the LookupNode class’ source code first, then I will explain why we need it:

import java.util.List;

public class LookupNode extends Node
{
    private VariableNode varNode;
    private Node keyNode;

    public LookupNode(VariableNode varNode, Node keyNode)
    {
        this.varNode = varNode;
        this.keyNode = keyNode;
    }
    
    public Object eval()
    {
        Object var = varNode.eval();
        int index = ((Integer) keyNode.eval()).intValue();        
        Object ret = ((List) var).get(index);        
        return ret;
    }
}

It is a class for encapsulating access to a pairing using key and value. It has two members:

  • keyNode is the key of the element we access
  • varNode is the value of the element we access

Both members are Node subclasses. When its eval method is called, it evaluates the key and the value and returns the element that is accessed with this key and value.

As you remember, variable node gets elements from the symbol table. So, the LookupNode object reads from the symbol table.

Now, let’s compile TokenType.java, Tokenizer.java and Parser.java with the new changes. There shouldn’t be any error when you compile.

Let’s make a demonstration of what we have done so far, save below script in array-access-demo.con:

script
    
    fruits = ["apple", "orange"]
    
    count = 0
    
    while count < 2
        println fruits[count]
        wait 300
        count = count + 1
    end       
    
end

You will see below output when you run this script:

apple
orange

Nice. Our script gets the array and prints its elements. We have successfuly completed the parsing array definition and array access.

What’s left? We need to implement array updates. Here is a quick example:

fruits[1] = "Banana"

It updates the fruits array by puting “Banana” into the array at index 1. Array update is a form of assignment as well, and since we know every assignment is a statement, we need to update the Statement() method.

Let’s start with putting a control into the Statement() method so we can determine if an operation is array access. Put this control in the Statement() method in Parser.java:

        else if (IsArrayAccess())
        {
            node = ArrayUpdate();
        }

So, here is the new version of the Statement() method:

    public Node Statement()
    {
        Node node = null;        
        TokenType type = CurrentToken().type;

        if (IsAssignment())
        {
            node = Assignment();
        }
        else if (IsArrayAccess())
        {
            node = ArrayUpdate();
        }        
        else if (IsWhile())
        {                           
            node = While();
        }
        else if (IsIfElse())
        {               
            node = If();
        }              
        else if (type == TokenType.PRINT)
        {
            MatchAndEat(TokenType.PRINT);
            node = new PrintNode(Expression(), "sameline");
        }
        else if (type == TokenType.PRINTLN)
        {   
            MatchAndEat(TokenType.PRINTLN);
            node = new PrintNode(Expression(), "newline");
        }
        else if (type == TokenType.WAIT)
        {
            MatchAndEat(TokenType.WAIT);
            node = new WaitNode(Expression());
        }
        else
        {        
            Util.Writeln("Unknown language construct: " 
                         + CurrentToken().text + CurrentToken().type);
            System.exit(0);
        }
        return node;       
    }

There is a new recognizer method here, IsArrayAccess:

    public boolean IsArrayAccess()
    {
        TokenType type = CurrentToken().type;
        return type == TokenType.KEYWORD &&
                        NextToken().type == TokenType.LEFT_BRACKET;
    }

Don’t forget to add it to Parser.java as well. It’s for checking if the operation is an attempt to access an array, like:

fruits[1] = "grape"

You see that there is a call to the ArrayUpdate() method:

    public Node ArrayUpdate()
    {
        String arrayName = MatchAndEat(TokenType.KEYWORD).text;
        Node array = new VariableNode(arrayName, this);
        
        MatchAndEat(TokenType.LEFT_BRACKET);
        Node indexExpr = Expression();
        MatchAndEat(TokenType.RIGHT_BRACKET);
        
        MatchAndEat(TokenType.ASSIGNMENT);
        Node rightSideExpr = Expression();
        
        return new ArrayUpdateNode(array, indexExpr, rightSideExpr);
    }    

This method handles assignments like this:

myArray[index] = yourArray[yourIndex]

First we get the array name, then the expression for the left index. The index itself is an expression that might be literal number, or an expression, like 1+2, or a variable just as it is in the above example.

Then, it matches and eats the “=” assignment operator. Afterwards, it gets the right side. This time, the Expression() method handles it because we added the facility for Expression to take care of array access.

Finally, at the end of this method, we create a new ArrayUpdateNode object. The ArrayUpdateNode object is for encapsulating array updates. The array element gets updated when the interpreter executes the ArrayUpdateNode object’s eval method.

Here is the ArrayUpdateNode class:

import java.util.List;

public class ArrayUpdateNode extends Node
{
    private Node array;
    private Node indexExpression;
    private Node rightSideExpression;
    
    public ArrayUpdateNode(Node array, Node indexExpression,
                            Node rightSideExpression)
    {
        this.array = array;
        this.indexExpression = indexExpression;
        this.rightSideExpression = rightSideExpression;
    }
    
    public int ToInt(Node node)
    {
        Object res = node.eval();
        return ((Integer) res).intValue();
    }
    
    public Object eval()
    {        
        Object arrayVariable = array.eval();
        int index = ToInt(indexExpression);
        Object newValue = rightSideExpression.eval();
        
        Object ret = ((List) arrayVariable).set(index, newValue);
        
        return ret;
    }
}

Save and compile all the classes with the new additions and updates.

Let’s write a new script to demo the array update functionality. Try the below script and save it to file array-update-demo.con:

script
    
    fruits = ["apple", "orange"]
    
    println "Fruits BEFORE Update"
    println "------------"
    count = 0    
    while count < 2
        println fruits[count]
        wait 300
        count = count + 1
    end
    
    # Lets leave a space here
    println ""
    
    grp = "Grape"
    sberry = "Strawbery"
    fruits[0] = grp
    fruits[1] = sberry

    println "Fruits AFTER Update"
    println "------------"
    count = 0    
    while count < 2
        println fruits[count]
        wait 300
        count = count + 1
    end    
    
end

Here is the output when we run it:

Fruits BEFORE Update
------------
apple
orange

Fruits AFTER Update
------------
Grape
Strawbery

With this demo script, we have demonstrated:

  • Array definition
  • Array access
  • Array update
  • Variables
  • And even comments!

Things are becoming pretty exciting! Good.

We have arrays functioning in our interpreter now.

In the next chapter, things will be even more exciting once we are going to implement functions. Keep on reading!