Lexical Analysis

Assignment 2
CPTR 415 Compiler Construction


  1. Overview
  2. In this assignment you will implement a lexical analyzer for the Decaf language.

  3. Requirements
    1. Tool. You are to use Lex to implement your lexical analyzer.

    2. Comments. Recall that anything between the // symbol and the end of the line (newline) is considered a comment. Your lexical analyzer should not attempt to tokenize any lexemes between the // symbol and the newline. Instead it should discard the // symbol and the remaining lexemes on that line (that is, no tokens are passed to the parser).

    3. Unary minus vs. binary minus. Defer differentiating unary minus from binary minus. It is possible to do so at this stage, but it is much simpler to just recognize a MINUS token and let the parser determine which it is.

    4. Errors. Your lexical analyzer should be able generate three types of error messages:

      • Single character errors. If the lexical analyzer cannot recognize a character and the character is followed by whitespace or an operator symbol, an error message should be printed indicating that the character could not be recognized.
      • Word error. If the lexical analyzer cannot recognize a character and the character is followed by letters, digits, or underscores, the lexical analyzer should
        1. scan to the first whitespace character or operator symbol,
        2. treat the string from the unrecognized to the whitespace or operator symbol as a word, and
        3. print out an error message indicating that the word could not be recognized. For example, the string hello?there would be split into an identifier token for hello and a word error message for ?there.

      • Too many errors. If the lexical analyzer finds more than 20 errors, it should issue an error message indicating that it is quitting due to too many errors.

  4. Format of Output
  5. Your lexical analyzer should output

    1. each token it recognizes,
    2. the value of each token (if appropriate), and
    3. the line and column number in which the token appears.

    The output should be formatted as follows:

    	line   column   token     value
    
    	xxxx   xxxx     xxxx      xxxx
    	

    If a token has no value (for example, a reserved word will have no value), then the value column will be empty Comments and whitespace should not be printed out as tokens.

    If an error is detected, then the token should be the character or word that is recognized, and the value should be the error message. The quit error message should be printed starting in the first column. For example, given the following input:

    	class foo {
    	   int x;
    	   foo village?market;
    	   ? == 5^;
    	

    your lexical analyzer might produce:

    	line   column   token             value
    
    	1      1        CLASS          
    	1      7        IDENTIFIER        foo
    	1      11       LCURLY
    	2      4        INT      
    	2      8        IDENTIFIER        x
    	2      9        SEMICOLON
    	3      4        IDENTIFIER        foo
    	3      8        IDENTIFIER        village
    	3      14       ?market           error: unrecognized word
    	3      21       SEMICOLON
    	4      4        ?                 error: unrecognized character
    	4      6        EQ               
    	4      9        NUMBER            5
    	4      10       ^                 error: unrecognized character
    	4      11       SEMICOLON
    
    	

    Your output will be graded manually, so things such as spacing between columns and the names of your tokens are up to you.

  6. Assignment Submission and Grading
  7. You should do all your developement within a single directory. When you are ready to submit the assignment, create a directory named lexer_submission in your home directory. copy all your files from your developement directory into this submission directory. Be sure to include a make file and check to ensure that builds your project successfully. To check your code I should need only run make to build your executable and execute it.

    Your lexical analyzer will be tested against a number of test files.

    You source code should use comment blocks to explain any non-obvious operations (for example, how you handle comments).

    The possible number of points for this assignment is 10.