CSE 131A - Scanner Assignment

Assignment #1 - Onyx Lexical Analysis
Due: 8:59:59pm Friday, January 26th, 2007

Description | Tasks | Advice | Deliverables | Resources

Change Log

Date  Description 
01-19-07 Comment concerning extraneous symbols in onyx_sym.java. [SBB]
01-4-07 Release of the document.

Phases

See the A1 Phases document for guidance.

Description

The first class assignment is to write a scanner and scanner driver that recognizes the lexical tokens for Onyx our subset of the XQuery language and outputs the lexical tokens as an OXML document. For the purposes of this assignment, the Onyx Lexical Specification should be considered the authoritative document.

The scanner for this assignment should recognize all of the lexical tokens for the Onyx language, and must follow the Onyx lexical specification provided. Each token has an associated symbol value (and integer id) defined in the file provided to you, onyx_sym.java, and your scanner must make correct use of the integer IDs defined in the file. The correspondence between each lexical element and its symbol value is identified in the Onyx Lexical Specification.

You will notice that there are extra symbols in onyx_sym.java which are not listed in the Lexical Specification. These do not correspond to any symbols recogized by our specification and you may ignore them (For those who want to know more: Onyx can be retooled to define a different language, and we have included additional symbols that might be reserved in another language). [1/19/07]

 

Tasks

In the following, "scan" means to match in the input text; "tokenize" means to output an appropriate OXML token element from the scanner driver. This will typically involve creating and returning from the scanner's next_token() method a java_cup.runtime.Symbol object with the appropriate int id and java.lang.String value.

For each tokenized token, the scanner driver must output an OXML token element with an id attribute whose value is the integer ID of the token. If the token has a value, it appears as text content of the element. If the token has null value, the element is empty. Line and column counting starts from 1. More information about the output format is given in the next section.

Deliverables

You will turn in your scanner project as a UNIX tar archive. When untarred into an empty working directory, a Makefile must appear at the top level. Other files and directory structure may also appear, as appropriate, but must contain no ".class", ".o", ".jar", or executable files.

The Makefile must have a target named "scan". The command

make scan

issued in the top level working directory should build (i.e., compile) your scanner and scanner driver, resulting in an executable file named scan in the working directory.

The scan program will be run with the command line of the form

   ./scan  filename

issued in that working directory. It should scan the input file named on the command line, and for each tokenized lexeme found in the file, scan must produce a OXML element representing the token, and insert it into an in-memory OXML document. When scanning completes, your program should output the required OXML document using a file with the same name as the input file, but with an added ".xml" extensions. The onyx_xml library, which was discussed in section, may be used to create conforming OXML documents in memory, to output them to a file, and also to read a conforming OXML document into memory. From ACS machines, it is pre-installed in the CSE 131A Tools directory: ~cs131w/../public/Tools/onyx_xml.jar. Documentation is found here.

The top-level element of the OXML document will be a single OnyxSource element with a required attribute filename with value equal to the input filename. The OnyxSource element will contain as many token elements as there are tokenized tokens in the input file, including the EOF token that terminated the scan. So for example, if the contents of file test.onyx are

let $a := 3 + 4
return $a

then the result of running scan test.onyx should be to produce an XML file test.xqy.xml containing the following

<?xml version="1.0" encoding="UTF-8"?>
<OnyxSource filename="test.onyx">
   <token column="1" id="57" line="1"/>
   <token column="5" id="10" line="1">$a</token>
   <token column="8" id="50" line="1"/>
   <token column="11" id="43" line="1">3</token>
   <token column="13" id="31" line="1"/>
   <token column="15" id="43" line="1">4</token>
   <token column="1" id="48" line="2"/>
   <token column="8" id="10" line="2">$a</token>
   <token column="-1" id="0" line="-1"/>
</OnyxSource>

Automated tools will check the output for correctness with respect to the required file format and the lexical specification.

(If you are interested, an XML Schema specification of the required form of the output file
can be found here.) 

Some test files and corresponding correct output files can be found under ~cs131w/../public/Materials/A1/Tests.

 

Advice

The onyx_sym.java file was generated from the Onyx Grammar using CUP. Since you won't write the grammar until the next assignment, we've generated a onyx_sym.java file for you. It will be advantageous for later assignments if the scanner is designed to integrate with a CUP generated parser without any major modifications. A scanner driver with particular functionality is in any case required for this assignment, as described below.

The CUP scanner integration standard expects the scanner to implement the following Java interface:

package java_cup.runtime;
public interface Scanner {
   public Symbol next_token() throws java.io.IOException;
}

The Symbol class is defines these instance variables (as well as several constructors):

      package java_cup.runtime;
      public class Symbol {
         public int sym;
         public int left, right;
         public Object value;
      }

The left, right fields can be used to hold the line and column location of the scanned token; the value field holds the token's value, or null if none. (You can reuse the right field to store the line number, avoiding the need to define a subclass)

Access to the java_cup.runtime package for both the Scanner interface definition and the Symbol class definition is easily accomplished if the CUP installation point is included on your CLASSSPATH, as in the CSE 131A environment.

A scanner developed using the JFlex tool can easily integrate with a CUP parser. When configured correctly, JFlex will write your scanner class source code file for you. You can then write a scanner driver that creates an instance of your  scanner class with the appropriate input stream, and repeatedly calls its next_token() method, encoding them as OXML.  See the JFlex documentation. There is also a sketch of the architecture of a JFlex-centric scanner written by Tim Pevzner (a former CSE 131A TA), which is available here. In your next assignment, a CUP-generated parser will take over the role of this scanner driver.

If you write this project using Java, say with your scanner driver implemented as a main method in class ScannerDriver in a package project, the scan target in the required Makefile can be given as

scan:
javac project/ScannerDriver.java
echo '#!/bin/sh' > scan
echo 'java project.ScannerDriver $$1' >> scan
chmod ugo+x scan

Then make scan will compile your ScannerDriver class and create an executable shell script named scan that will invoke the java interpreter on it, passing it the first command-line argument. Makefiles are a good way to manage dependencies among parts of your project.  In particular, it can be useful to have a clean target, defined so that make clean will remove all compiled files so that they can all be recompiled anew, and a tests target, so that make tests will check your program against a suite of tests that you create.  Lastly, it would be convenient for you if your Makefile can build the scanner directly from the .flex specification, i.e. by running jflex and then compiling the resulting Java file.   For more information, GNU has a tutorial on make.

 

Turnin

In addition to your assignment, you must include three other files:

  • a file named MEMBERS which contains the names and login names of the members of your programming team (even if there is only one of you).
  • a file named DECLARATION stating that the work was your own, and that you did not violate the Academic Integrity policies in effect for the course.
  • a file called TeamEval describing your experiences.
  • Refer to the A1 Phases document for additional instructions about these files.

    Assignments that do not include a properly completed DECLARATION, MEMBERS, and TeamEval files will not receive a grade and will be treated as late assignments.

    From the directory containing your Makefile, MEMBERS, DECLARATION, and TeamEval files and the root of your A1 development directory, run the command TURNIN.A1 and follow the dialog. You can run this command as many times as you want up until the assignment deadline; any earlier turnin of yours will just be overwritten. In fact it is recommended that you run the command early, as soon as you have a partial solution to the assignment, to make sure the turnin mechanism is working for you.

    DON'T WAIT UNTIL THE LAST MINUTE TO RUN TURNIN FOR THE FIRST TIME!!!

    Often problems are uncovered when running TURNIN. You'll need to resolve these problems before your submission can be accepted. Waiting until the last minute to run TURNIN for the first time per assignment is done at your own risk!


    Resources

    onyx_sym.java
    The onyx_sym.java file defines a set of static final int values for the tokens of Onyx language. This file is automatically generated by CUP. This file provides a consistent set of token id values for everyone working on this assignment. From ACS machines, onyx_sym.java is available in the CSE 131A "~cs131w/../public/Materials/A1" directory.

    onyx_xml library
    The onyx_xml library may be used to create OXML documents in memory, to output them to a file, and also to read a compliant OXML document into memory. From ACS machines, it is pre-installed in the CSE 131A Tools directory: ~cs131w/public/Tools/onyx_xml.jar. Documentation is found HERE.

    JFlex
    The JFlex tool generates high-quality scanners from an input grammar of regular expressions. The resulting scanners integrate very well with CUP-generated parsers. From ACS machines, it is pre-installed in the CSE 131A ~cs131w/public/Tools/jflex directory.   JFlex is free software, released under the GPL, and available for free download directly from the vendor at www.jflex.de

    CUP
    For this assignment, the CUP tool is need only for its definition of the Scanner interface and Symbol class. From ACS machines, it is pre-installed as a jar file in the CSE 131A Tools directory: ~cs131w/public/Tools/java-cup-v11a.jar.  CUP is free software, released under the GPL, and available for free download directly from the support site located at http://www2.cs.tum.edu/projects/cup/.

    Test Cases
    We have published test cases with corresponding output. The actual test case files and the results reside in ~/../public/Materials/A1/Tests.
    (The filename for each test result adds a .xml suffix to the filename for its corresponding test case.)