During this course you will implement a compiler for the Decaf programming language. Decaf is an adaptation of JavaTM. (The description of Decaf presented here is actually an adaptation of the original Decaf language specification devised by Brad Vander Zanden.) Decaf presents a number of learning opportunities:
Student compilers will be written in the C++ programming language and will take advantage of the Lex (Flex) and Yacc (Bison) compiler tools. Your compiler must be able to be built with GNU C++ (G++) and run on the cpp.cs.southern.edu programming host (running Linux). You are free to develop on another platform, but you should check to make sure it works on cpp before submitting it.
Your compiler will generate MIPS assembly code that will run on the SPIM simulator.
If you have your own PC, all the tools necessary for writing the Decaf compiler are available for free (Linux, GNU C++, SPIM).
Decaf is a strongly typed language that supports the following features:
In order to make the compiler construction process tractable for a semester course, many important features of Java are omitted from Decaf. These include inheritance, exceptions, interfaces, static members, constructors, multidimensional arrays, threads, other conditional and iteration statements (like switch and for), access modifiers (public, private, protected), garbage collection, and package (library) support.
Here is a sample Decaf program that computes the greatest common divisor of two numbers entered by the user (the gcd method implements Euclid's algorithm):
class Example { void main() { int x; int y; x = read(); y = read(); print(gcd(x, y)); } int gcd(int a, int b) { if ( b == 0 ) return a; else return gcd(b, a % b); } }
The following Decaf program creates a linked list:
class ListNode { int data; ListNode next; void setData(int x) { data = x; } } class List { ListNode head; void show(ListNode list) { ListNode cursor; int count; count = 0; cursor = list; while ( cursor != null ) { count = count + 1; print(count, cursor.data); cursor = cursor.next; } } void main() { ListNode second; ListNode third; head = new ListNode(); head.setData(4); // Could use "head.data = 4;" print(head.data); second = new ListNode(); second.setData(5); print(second.data); third = new ListNode(); third.setData(11); print(third.data); head.next = second; second.next = third; third.next = null; show(head); } }
The following notational conventions are used throughout this document:
1. |
PROGRAM |
--> |
CLASS_DECLARATION+ |
2. |
CLASS_DECLARATION |
--> |
class identifier CLASS_BODY |
3. |
CLASS_BODY |
--> |
{ VAR_DECLARATION* METHOD_DECLARATION* } |
4. |
VAR_DECLARATION |
--> |
TYPE identifier ; |
5. |
TYPE |
--> |
SIMPLE_TYPE |
| |
SIMPLE_TYPE[] |
||
6. |
SIMPLE_TYPE |
--> |
int |
| |
identifier |
||
7. |
METHOD_DECLARATION |
--> |
RESULT_TYPE identifier ( PARAMETER_LIST ) METHOD_BODY |
8. |
RESULT_TYPE |
--> |
TYPE |
| |
void |
||
9. |
PARAMETER_LIST |
--> |
epsilon |
| |
PARAMETER < , PARAMETER >* |
||
10. |
PARAMETER |
--> |
TYPE identifier |
11. |
METHOD_BODY |
--> |
{ LOCAL_VAR_DECLARATION* SIMPLE_STATEMENT* } |
12. |
LOCAL_VAR_DECLARATION |
--> |
TYPE identifier ; |
13. |
STATEMENT |
--> |
SIMPLE_STATEMENT |
| |
{ SIMPLE_STATEMENT+ } |
||
14. |
SIMPLE_STATEMENT |
--> |
; |
| |
NAME = EXPRESSION ; |
||
| |
NAME ( ARG_LIST ) ; |
||
| |
print ( ARG_LIST ) ; |
||
| |
CONDITIONAL_STATEMENT |
||
| |
while ( EXPRESSION ) STATEMENT |
||
| |
return OPTIONAL_EXPRESSION ; |
||
15. |
NAME |
--> |
VARIABLE_ACCESS |
|
| |
VARIABLE_ACCESS.FIELD_ACCESS |
|
16. |
VARIABLE_ACCESS |
--> |
this |
| |
identifier |
||
| |
identifier [EXPRESSION] |
||
17. |
FIELD_ACCESS |
--> |
identifier |
|
|
| |
identifier [EXPRESSION] |
18. |
ARG_LIST |
--> |
epsilon |
| |
EXPRESSION < , EXPRESSION >* |
||
19. |
CONDITIONAL_STATEMENT |
--> |
if ( EXPRESSION ) STATEMENT |
| |
if ( EXPRESSION ) STATEMENT else STATEMENT |
||
20. |
OPTIONAL_EXPRESSION |
--> |
epsilon |
| |
EXPRESSION |
||
21. |
EXPRESSION |
--> |
NAME |
| |
number |
||
| |
null |
||
| |
NAME ( ARG_LIST ) |
||
| |
read () |
||
| |
NEW_EXPRESSION |
||
| |
UNARY_OP EXPRESSION |
||
| |
EXPRESSION RELATION_OP EXPRESSION |
||
| |
EXPRESSION SUM_OP EXPRESSION |
||
| |
EXPRESSION PRODUCT_OP EXPRESSION |
||
| |
( EXPRESSION ) |
||
22. |
NEW_EXPRESSION |
--> |
new identifier () |
| |
new int [ EXPRESSION ] |
||
| |
new identifier [ EXPRESSION ] |
||
23. |
UNARY_OP |
--> |
+ |
| |
- |
||
| |
! |
||
24. |
RELATION_OP |
--> |
== |
| |
!= |
||
| |
<= |
||
| |
>= |
||
| |
< |
||
| |
> |
||
25. |
SUM_OP |
--> |
+ |
| |
- |
||
| |
|| |
||
26. |
PRODUCT_OP |
--> |
* |
|
|
| |
/ |
| |
% |
||
| |
&& |
Operator precedence is as follows (from highest to lowest):
Within each group of operators each operator has the same precedence; for example, +, -, and || have the same precedence.
All operators associate left-to-right; for example, a + b + c is interpreted as (a + b) + c.
The Decaf language observes the following lexical conventions:
letter |
--> |
[a-z] | [A-Z] |
digit |
--> |
[0-9] |
alpha |
--> |
letter | digit | _ |
identifier |
--> |
letter alpha* |
number |
--> |
digit* |
class else if int new null print read return this void while
[ ] { } != == < > <= >= && || ! + - * / % ; , ( ) = // .
Decaf has two primitive types:
Decaf has two structured types:
All array and class variables (objects) in Decaf are reference variables. When a variable is declared, space is allocated for a reference, but no space is allocated for an object. The programmer must explicitly create an object with the new operator and then assign the reference that new returns to the appropriate variable. Java works exactly the same way. When an object is no longer referenced by any variable, it becomes garbage; Java would properly collect the garbage, but Decaf simply ignores the object. This failure to reclaim storage should pose no problem for most of the Decaf programs you will write.
The following code performs the actions indicated by the comments:
class Foo { // Declare a class named Foo int key; } Foo a; // Declare a couple of variables that may Foo b; // reference objects of type Foo a = new Foo(); // Create a Foo object and assign the // reference to variable a b = a; // b references the same Foo object as a a.key = 10; print(b.key); // 10 gets printed a = new Foo(); // a and b will now reference different b = new Foo(); // objects
Parentheses appear after the variable name when a class object is created with new. Java requires this since the use of new actually calls a constructor, and arguments can be supplied if needed. Decaf does not implement constructors, but the parentheses remain so Decaf code looks more like Java code.
A class declaration creates a new class type and specifies its implementation. The identifier immediately after the class keyword defines the name of the class. The class body then defines the implementation by possibly declaring a set of instance variables and a set of methods.
2. |
CLASS_DECLARATION |
--> |
class identifier CLASS_BODY |
3. |
CLASS_BODY |
--> |
{ VAR_DECLARATION* METHOD_DECLARATION* } |
The body of a class is delimited by curly braces. There is no semicolon after the class body. Java allows variable and method declarations to be intermixed, but for simplicity Decaf requires that all instance variable declarations precede all method declarations.
Variable Declarations. Java supports multiple variable declarations (variables separated by commas) in a single declaration and variable initialization. Decaf allows only one variable to be declared at a time and omits variable initializers. Decaf also omits class variables (refered to as static in Java).
Integer instance variables should be automatically initialized to zero (0), and reference variables should be initialized to null.
Types. A valid type for a variable is either the primitive type int, a previously defined class, or an array type. The current class is considered to be a previously defined class, so the type of an instance variable may be the class being defined. For example, the following definition is valid:
class TreeNode { int key; TreeNode left_child; TreeNode right_child; }
Unlike C/C++, Java allows the square brackets of an array declaration to be placed after the type, instead of after the variable (Java also allows the C/C++ style of brackets after the variable). Decaf requires the brackets to be placed after the type. Also unlike C/C++, Java does not specify a size at the point of declaration; the size is provided when the new operator is used. Consequently, an array variable may reference different length arrays over its lifetime. Example array declarations include:
int[] a; // Declare a Foo[] f; // Declare f a = new int[10]; // Allocate array object a f = new Foo[1000]; // Allocate array object f
Identifier names. It is not an error to assign a variable the same name as a previously or subsequently defined class name (although it is poor programming style).
Method Declarations. A method is similar to a function in conventional imperative languages. A method is a block of code that can be invoked with zero or more values as arguments. Unlike a traditional function, a method belongs to a particular class, and its definition must appear within the body of a class definition.
Java supports special methods called constructors that automatically initialize an object when new is invoked. For simplicity, Decaf does not have constructors.
Java supports method overloading; that is, multiple methods with the same name but different parameter lists. For simplicity, Decaf does not support method overloading, so it is a compile time error if two methods in a class have the same name.
In Decaf, unlike Java, methods and instance variables must have unique names; that is, in Decaf it is illegal to have an instance variable and a method with the same name within the same class.
Return Types. If a method does not return a value, its return type should be declared void. If the return type is an array type, Decaf requires the empty brackets be placed after the type name. For example:
int[] foo(int a, int b) { . . . } // Correct int foo[](int a, int b) { . . . } // Incorrect int foo(int a, int b)[] { . . . } // Incorrect
Parameters. Parameters are considered local variables of the method (that is, they are in the method's namespace). They may not be overridden by other local variables, and it is a compile time error to attempt to do so.
Parameters are passed by value. Thus, integer arguments are copied to their corresponding parameters and that reference arguments are copied to their corresponding parameters.
this. The keyword this may be used within a method to reference the current object. this is useful for referencing shadowed variables or methods (that is, variables or methods in the current object that are masked by identically named local variables or parameters in the method) or for passing the current object as an argument to another object's method. For example:
class Example { int key; int foo; void setKey(int key, AnotherClass obj) { this.key = key; // key is a local variable // (actually a parameter) that // masks an instance variable // so this must be used foo = key; // foo == this.foo but key // is the parameter key obj.notify(this); // Pass the current object to // the method of another // object } }
As illustrated above, the expression "this." is appended implicitly to any non-shadowed instance variable or method of the current object. If a local variable has the same name as an instance variable or method, then the programmer must explicitly preface the instance variable or method name with "this." in order to access the shadowed instance variable or method.
Body of Methods. The body of a method consists of a sequence of local variable declarations followed by statements:
11. |
METHOD_BODY |
--> |
{ LOCAL_VAR_DECLARATION* SIMPLE_STATEMENT+ } |
Unlike Java, Decaf requires that all local variable declarations precede any statement. Java allows declarations and statements to be intermixed. The syntax for local declarations is identical to that for instance variables: one variable per declararion and no initialization.
Unlike Java, Decaf does automatically assign zero (0) to local integers does automatically assign null to local reference variables. This permits a run time check for uninitialized references. (Java checks for uninitialized variables at compile time.)
There are only two namespaces in Decaf---the method namespace and the class namespace. When a variable is declared in a method body it is placed in the method's namespace.
A local variable's declaration is in force throughout the body of the method in which it is declared. If an identifier previously declared as an instance variable or method in the enclosing class, then the other declaration is hidden for the remainder of the method body. this must be used to access these hidden instance variables and methods.
Type scope. Unlike Java, an identifier with the same name as a previously defined class does not hide the class name. Hence, the previously defined class name can still be used as a type name. In other words, a type name always remains in force.
Variable Name Resolution Summary. The compiler should resolve a variable reference by looking first in the method namespace. If the variable cannot be found in the method namespace, the compiler should search the enclosing class's namespace. If the variable is found in neither the method namespace nor the enclosing class's namespace, then the variable is considered undeclared and a compile time error should be generated.
Decaf statements do not have values; they are evaluated only for their effect. In Decaf (unlike Java) this restriction extends to assignment. For example, in Java one can write expressions such as a = b = c or (a = b)[3]. Such statements are illegal in Decaf since an assignment statement has no value.
Statements may be one of the following:
14. |
SIMPLE_STATEMENT |
--> |
; |
| |
NAME = EXPRESSION ; |
||
| |
NAME ( ARG_LIST ) ; |
||
| |
print ( ARG_LIST ) ; |
||
| |
CONDITIONAL_STATEMENT |
||
| |
while ( EXPRESSION ) STATEMENT |
||
| |
return OPTIONAL_EXPRESSION ; |
Empty Statement. The semicolon by itself represents an empty statement. It does nothing.
Assignment. An assignment statement evaluates an expression and assigns the result to a variable:
NAME = EXPRESSION ;
A NAME is either the reserved keyword this, an identifer, a field access via the dot notation, or an array access:
15. |
NAME |
--> |
VARIABLE_ACCESS |
| |
VARIABLE_ACCESS.FIELD_ACCESS |
||
16. |
VARIABLE_ACCESS |
--> |
this |
| |
identifier |
||
| |
identifier [EXPRESSION] |
||
17. |
FIELD_ACCESS |
--> |
identifier |
| |
identifier [EXPRESSION] |
Note that array accesses may occur in either half of a dotted name. For example, both a[3].b and c.d[2] are legal names.
Method Calls. A method call consists of the name of the method name and a list of zero or more comma-separated arguments. The name may be a dotted name; for example, obj.f(3, x) A Decaf program may ignore the return value of a method which is not declared void. For example, if foo returns an int, then both of the following statements are valid:
stack[3].foo(5, 10); x = stack[3].foo(5, 10);
Output. The print operator writes zero or more integers to the screen. Each displayed value is followed by a blank space. The arguments to print must all be integer-valued expressions. A print operator with no arguments prints a new line; thus, a print operator with no arguments may be used to obtain a blank line in the output.
For example, the code fragment
print(34, 4, 2, 15); print(); print(34, 4, 2, 15); print(15); print(16); print(); print(-4); print(); print(); print(-4);
would display
34 4 2 15 34 4 2 15 15 16 -4 -4
There is no way to display a string of characters.
Decaf's technique of handling input and output is different from Java's. Decaf builds the input and output commands into the language; Java uses library commands for I/O. Java's libraries are more flexible since developers can write new I/O routines to match the needs of their applications or handle new I/O devices. Since Decaf was devised to teach the concepts of compiler construction and is not meant to be a production language, the limitations of Decaf's I/O is acceptable.
Conditional Statements. For simplicity, Decaf does not support the switch construct of Java. It does support the if/else statement:
19. |
CONDITIONAL_STATEMENT |
--> |
if ( EXPRESSION ) STATEMENT |
| |
if ( EXPRESSION ) STATEMENT else STATEMENT |
In Decaf, the conditonal expression must evaluate to an int. As in C/C++, any integer except 0 is considered to be true, and 0 is considered false. (Java requires that conditional expression be type boolean, but Decaf does not support a boolean type.)
The dangling else problem is resolved by matching an else with the closest previous unmatched if in the same method body. (Java resolves the dangling else the same way.)
Iteration. Java supports multiple loop forms, including while, do, and for. For simplicity, Decaf supports only the while construct:
while ( EXPRESSION ) STATEMENT
The expression in the while statement must evaluate to an integer. The statement is executed until the expression evaluates to 0 (that is, false).
Java supports break and continue that alter the normal program flow in a loop. These constructs are convenient but are not absolutely necessary, and thus they have been omitted from Decaf to make it more managable.
Return Statement. Values are returned from methods via the return statement.
return OPTIONAL_EXPRESSION ;
The type of the return expression must be identical to the return type of the method in which it is used. It is a compile time error if the types do not agree or for the return expression to have an argument in a void method.
Compound Statements. A compound statement can be used to group a sequence of statements to be executed as the body of a while of if statement. A compound statement is a sequence of statements enclosed in curly braces. Unlike Java, Decaf does not support "inner blocks." In Java, a pair of curly braces within a method body defines an inner block in which variables can be declared which are local to this inner block. Since Decaf supports only two namespaces, method and class, inner blocks for finer control of variable scoping are not supported.
Decaf expressions are simpler and therefore less powerful than Java expressions. For example, Java allows object references to be determined by complex expressions, but Decaf requires that object references be determined by variable names. The following expressions are legal in Java but illegal in Decaf:
c = (a = b)[3]; // c is assigned the contents of b[3] and // a is assigned a reference to the array // pointed to by b c = foo.goo(a,b).x; // the object containing field x is determined // at runtime by executing the method goo // in the object referenced by object foo
One would use multiple Decaf statements to achieve the effects of each of these single Java statements.
Decaf expressions involve the following:
Input
read reads and returns and 32-bit signed integer from the keyboard. Read is an operator like print and new; it is neither a function nor a method. It looks and behaves as if it were a function, however. The trailing parentheses emphasize the fact that read is not a simple variable.
Arithmetic and Relational Operators
The arithmetic and relational operators are straightforward. Integer division (/) truncates. For example, 8/3 = 2. Your compiler must insert code that detects division by zero. If your code detects an attempt to divide by zero, the program should terminate and report the execptional condition.
Relational operators should return integer 1 if the expression is true and the integer 0 if the expression is false.
Conditional and, Conditional or, and not
The conditonal and operator (&&) should evaluate only its first operand if the first operand is false (i.e., the integer 0). Similarly, the conditional or (||) operator should evaluate only its first operand if the first operand evaluates to true (i.e., any integer except 0). Both conditional and and conditional or operators should return the integer 1 if the expression is true and the integer 0 if the expression is false.
The not operator (!) should return the integer 0 if the expression is true and the integer 1 if the expression is false.
The new Operator
The new operator creates a new object, initializes all its instance variables to zero or null (as appropriate), and returns a reference to the newly created object.
Array Allocation. An allocation request for an array object must specify the size of the array. The new operator allocates enough space to accommodate the requested number of objects. For example, the allocation request
a = new int[10];
would be handled by allocating space for 10 integer components (40 bytes).
Note that an array variable is not allocated storage when it it declared. The dimension is not specified when the array is declared. An array declaration only allocates space for a reference to the array (4 bytes). Consequently, the only way to allocate space for an array object is to use the new operator.
Order of Evaluation
All operands in an expression are evaluated left-to-right order, regardless of the precedence of operators. For example, suppose a[3] is originally 5 and that the method obj.foo() returns 20 and sets a[3] to 10. Then the result of the expression
obj.foo() + (a[3] * a[3])
is 20 + (10 * 10) or 120 (notice that even though the parenthesized expression has greater precedence, the operands are evaluated in left-to-right order). Similarly, all parameters are evaluated in strictly left-to-right order. Specific rules such as these are required because because side effects can cause different results depending on the order in which the operands are evaluated. For example, if the operands in the expression obj.foo() + (a[3] * a[3]) were evaluated in right-to-left order, the result would be 20 + (5 * 5) or 45. Hence, to ensure that Decaf programs generated by different compilers produce the same results, an order of evaluation must be imposed.
The Java language specification imposes a strict order of evaluation, unlike C and C++ in which the order of evaluation is unspecified. To see why this is important, this sequence of C++ code:
Stack s = new Stack(); s.push(3); s.push(5); cout << s.pop() << " " << s.pop() << endl;
will print "3 5" on some compiler/platform combinations and "5 3" on others.
null References
It is a runtime error to attempt to access the field of a null reference. For example, if the statement
obj.x = 2;
were executed when obj is null, the program should be terminated and an error message should be displayed.
Decaf arrays are one-dimensional arrays. As in Java, arrays are first-class objects with a single publicly accessible field called length. If the array's length is n, then its components range from 0 to n - 1 inclusive. The array subscripts must be integers (i.e., int types). A runtime error must be generated and the program terminated if the subscript is out of range. Java behaves similarly by throwing an exception.
An array's length field may be examined to determine the number of elements in the array. This length field is read only; once new is used to allocate the array, its length never changes. The array itself can always be reassigned, of course: it can be assigned to another, already allocated, array, and it can be assigned to the result of invoking the new operator again.
Arrays of zero length can be allocated. In this case, the length field of the array object would be zero, and the array itself would be empty. The array object has a secret instance variable that points to the array components. The only way the programmer can access these components is via square brackets. For example, the fourth element of array foo is accessed via the expression foo[3]. The length field is accessed via the standard dot notation: foo.length.
A Decaf program must have exactly one class with a method named main(). main() should take no arguments, and its return type is void.