Assignment #1 - Onyx Lexical Analysis
Due: 8:59:59pm Friday, January 26th, 2007
Description | Tasks | Advice | Deliverables | Resources
| Date | Description |
| 01-19-07 | Comment concerning extraneous symbols in onyx_sym.java. [SBB] |
| 01-4-07 | Release of the document. |
The first class assignment is to write a scanner and scanner driver that recognizes the lexical tokens for Onyx our subset of the XQuery language and outputs the lexical tokens as an OXML document. For the purposes of this assignment, the Onyx Lexical Specification should be considered the authoritative document.
The scanner for this assignment should recognize all of the lexical tokens for the Onyx language, and must follow the Onyx lexical specification provided. Each token has an associated symbol value (and integer id) defined in the file provided to you, onyx_sym.java, and your scanner must make correct use of the integer IDs defined in the file. The correspondence between each lexical element and its symbol value is identified in the Onyx Lexical Specification.
You will notice that there are extra symbols in onyx_sym.java which are not listed in the Lexical Specification. These do not correspond to any symbols recogized by our specification and you may ignore them (For those who want to know more: Onyx can be retooled to define a different language, and we have included additional symbols that might be reserved in another language). [1/19/07]
In the following, "scan" means to match in the input text; "tokenize" means to output an appropriate OXML token element from the scanner driver. This will typically involve creating and returning from the scanner's next_token() method a java_cup.runtime.Symbol object with the appropriate int id and java.lang.String value.
For each tokenized token, the scanner driver must output an OXML token element with an id attribute whose value is the integer ID of the token. If the token has a value, it appears as text content of the element. If the token has null value, the element is empty. Line and column counting starts from 1. More information about the output format is given in the next section.
You will turn in your scanner project as a UNIX tar archive. When untarred into an empty working directory, a Makefile must appear at the top level. Other files and directory structure may also appear, as appropriate, but must contain no ".class", ".o", ".jar", or executable files.
The Makefile must have a target named "scan". The command
make scan
issued in the top level working directory should build (i.e., compile) your scanner and scanner driver, resulting in an executable file named scan in the working directory.
The scan program will be run with the command line of the form
./scan filename
issued in that working directory. It should scan the input file named on the command line, and for each tokenized lexeme found in the file, scan must produce a OXML element representing the token, and insert it into an in-memory OXML document. When scanning completes, your program should output the required OXML document using a file with the same name as the input file, but with an added ".xml" extensions. The onyx_xml library, which was discussed in section, may be used to create conforming OXML documents in memory, to output them to a file, and also to read a conforming OXML document into memory. From ACS machines, it is pre-installed in the CSE 131A Tools directory: ~cs131w/../public/Tools/onyx_xml.jar. Documentation is found here.
The top-level element of the OXML document will be a single OnyxSource element with a required attribute filename with value equal to the input filename. The OnyxSource element will contain as many token elements as there are tokenized tokens in the input file, including the EOF token that terminated the scan. So for example, if the contents of file test.onyx arelet $a := 3 + 4 return $a
then the result of running scan test.onyx should be to produce an XML file test.xqy.xml containing the following
<?xml version="1.0" encoding="UTF-8"?> <OnyxSource filename="test.onyx"> <token column="1" id="57" line="1"/> <token column="5" id="10" line="1">$a</token> <token column="8" id="50" line="1"/> <token column="11" id="43" line="1">3</token> <token column="13" id="31" line="1"/> <token column="15" id="43" line="1">4</token> <token column="1" id="48" line="2"/> <token column="8" id="10" line="2">$a</token> <token column="-1" id="0" line="-1"/> </OnyxSource>
Automated tools will check the output for correctness with respect to the required file format and the lexical specification.
(If you are interested, an XML Schema specification of the required form of the output file can be found here.)
The onyx_sym.java file was generated from the Onyx Grammar using CUP. Since you won't write the grammar until the next assignment, we've generated a onyx_sym.java file for you. It will be advantageous for later assignments if the scanner is designed to integrate with a CUP generated parser without any major modifications. A scanner driver with particular functionality is in any case required for this assignment, as described below.
The CUP scanner integration standard expects the scanner to implement the following Java interface:
package java_cup.runtime;
public interface Scanner {
public Symbol next_token() throws java.io.IOException;
}
The Symbol class is defines these instance variables (as well as several constructors):
package java_cup.runtime;
public class Symbol {
public int sym;
public int left, right;
public Object value;
}
The left, right fields can be used to hold the line and column location of the scanned token; the value field holds the token's value, or null if none. (You can reuse the right field to store the line number, avoiding the need to define a subclass)
Access to the java_cup.runtime package for both
the Scanner interface definition and the Symbol class definition
is easily accomplished if the CUP installation point is included on
your
CLASSSPATH, as in the CSE 131A environment.
A scanner developed using the JFlex tool can easily integrate with a CUP parser. When configured correctly, JFlex will write your scanner class source code file for you. You can then write a scanner driver that creates an instance of your scanner class with the appropriate input stream, and repeatedly calls its next_token() method, encoding them as OXML. See the JFlex documentation. There is also a sketch of the architecture of a JFlex-centric scanner written by Tim Pevzner (a former CSE 131A TA), which is available here. In your next assignment, a CUP-generated parser will take over the role of this scanner driver.
If you write this project using Java, say with your scanner driver implemented as a main method in classScannerDriver
in a package project, the scan target in the
required
Makefile can be given as
scan:
javac project/ScannerDriver.java
echo '#!/bin/sh' > scan
echo 'java project.ScannerDriver $$1' >> scan
chmod ugo+x scan
Then make scan will compile your ScannerDriver class and create an executable shell script named scan that will invoke the java interpreter on it, passing it the first command-line argument. Makefiles are a good way to manage dependencies among parts of your project. In particular, it can be useful to have a clean target, defined so that make clean will remove all compiled files so that they can all be recompiled anew, and a tests target, so that make tests will check your program against a suite of tests that you create. Lastly, it would be convenient for you if your Makefile can build the scanner directly from the .flex specification, i.e. by running jflex and then compiling the resulting Java file. For more information, GNU has a tutorial on make.
In addition to your assignment, you must include three other files:
Refer to the A1 Phases document for additional instructions about these files.Assignments that do not include a properly completed DECLARATION, MEMBERS, and TeamEval files will not receive a grade and will be treated as late assignments.
From the directory containing your Makefile, MEMBERS, DECLARATION, and TeamEval files and the root of your A1 development directory, run the command TURNIN.A1 and follow the dialog. You can run this command as many times as you want up until the assignment deadline; any earlier turnin of yours will just be overwritten. In fact it is recommended that you run the command early, as soon as you have a partial solution to the assignment, to make sure the turnin mechanism is working for you.
DON'T WAIT UNTIL THE LAST MINUTE TO RUN TURNIN FOR THE FIRST TIME!!!
Often problems are uncovered when running TURNIN. You'll need to resolve these problems before your submission can be accepted. Waiting until the last minute to run TURNIN for the first time per assignment is done at your own risk!