Lexical Specification

Change Log

Date Description
1/17/07 Slight correction to the rule for comments [SBB] 

Description

The scanner for Assignment 1 should recognize the following lexical tokens for Onyx language. Each token has an associated integer ID as defined in the file onyx_sym.java. Although the name of a final variable defined in that file may be somewhat different from its corresponding name below, the equivalence should be sufficiently clear. Also note that the regular expression syntax here may be different from that used by any particular scanner generator tool.

Tokens are broadly grouped into the following categories:

Keywords | Punctuation and Operator Symbols | QNames, Variables, and Literals
Comments | Whitespace | Lexical Errors and EOF

Keywords

The scanner should appropriately recognize the following Onyx keywords:

declare function variable
as
if
then else for
let where return in
to
except before after
ordered-by descending
ascending intersect
union le lt ge
gt eq
ne and
or div
idiv
mod
 

Onyx is case-sensitive, and there is only one way to spell each of the keywords, so there is only one possible lexeme for each keyword token. Therefore, these keyword tokens are completely determined by their integer ID, and are considered to have null value.  The location of a keyword token is the location of the first character in the matched lexeme.

Punctuation and operator symbols

The scanner should recognize the following Onyx punctuation symbols and symbolic operators:

/ Slash   // Slash Slash
= Equals   != Not Equals
<= Less Than or Equals   >= Greater Than or Equals
< Less Than   >
Greater Than
:= Colon Equals   *
Mult
- Minus   + Plus
|
Vertical Bar
  , Comma
( Left Parenthesis   ) Right Parenthesis
[ Left Square Bracket   ] Right Square Bracket
{ Left Curly Brace   } Right Curly Brace
; Semicolon      

Punctuation and symbolic operator tokens are completely determined by their integer ID; they have no independent value.  Their location is the location of the first character in the matching lexeme.

QNames, Variables, and Literals

The scanner should recognize the following tokens. Unlike keywords and punctuation, an instance of these tokens will have a nontrivial string value based on the specific matching lexeme in addition to their token id.

QName

QName ::= ( Letter | "_" ) ( Letter | Digit | "." | "-" | "_" )*

A QName in Onyx is an identifier name and differs from XML qualified name because namespace specifiers are not supported.

The value of a QName token is the matched lexeme.  The  location for a QName token is the line and column of the first character in the matched lexeme.

Variable name

VARNAME ::= "$" QName
In Onyx, variable names are just QNames preceded by a dollar sign ($) with no intervening whitespace.

The value of a VariableName token is the matched lexeme.  The  location for a VariableName token is the line and column of the first character in the matched lexeme, i.e. the $ character.

Boolean literal
BOOLEANLITERAL ::= "true" | "false"
Integer literal
INTEGERLITERAL ::= (Digit)+
Decimal literal
DECIMALLITERAL ::= ( "." (Digit)+ ) | ( (Digit)+ "." (Digit)* )
Onyx recognizes only one type of numeric literal: Integer. However, we include the specification for Decimal since Decimal numbers may appear in XML.
The lexical value of a numeric literal token is the matched lexeme.  The line and column location for a numeric literal token is the line and column of the first character in the matched lexeme.
String literal

STRINGLITERAL ::= ( " ( "" | [^"] )* " )

Onyx strings are delimited with quotation marks ("). The quotation mark itself can be embedded in the string with a doubled occurrence.The " (Quote) character in Onyx strings are only recognized as Quote if the character sequence Quote Quote occurs in a terminated string.

Predefined entity references and Unicode character references can appear within a string, and when constructing the value of the string, must be interpreted as referring to the appropriate character.

The five predefined entity references are:

ENTITY REFERENCE
REFERS  TO
"&lt;"
<
"&gt;"
>
"&amp;"
&
"&quot;"
"
"&apos;"
'

Patterns for Unicode character references are (HexDigit is the character class [0-9a-fA-F] ): 

CHARACTER REFERENCE PATTERN
REFERS TO
"&#"  Digit+  ";"
The Unicode character with the given  decimal codepoint value, if it exists; else undefined
"&#x" HexDigit+ ";"
The Unicode character with the given hexadecimal codepoint value, if it exists; else undefined

The value for a string literal token is the sequence of characters that is intended to be represented by the string literal. That is, the value removes the enclosing delimiters and translates any contained doubled delimiter references.

The line and column location for a string literal token is the location of the quote that begins the literal.  For example, the string literal

"How to embed "": use double occurrence"

has value

How to embed ": use double occurrence

and its location is the location of the quotation mark that begins the literal (and which doesn't appear in the value).

For another example, the string literal

"He &amp; she said ""arr&#x00EA;t!"""
has value
He & she said "arrêt!"

and its location is the location of the quote that begins the literal (and which doesn't appear in the value).  (Most browsers will display that line correctly; the character reference &#x00ea refers to the Unicode character French e-circonflex.)

Comments

Onyx comments may not be nested, and they are never tokenized.

Comment ::= "{--" ~"--}"
{-- This is a comment --}

Whitespace

Whitespace characters are spaces, newlines, carriage returns, and tabs, i.e.  the character class [ \n\r\t].

WS ::= [ \n\r\t]*


WS is never tokenized as such, though whitespace characters can appear in the value of some tokens (such as string literals, etc.).

Lexical errors and EOF

In each lexical state, only certain patterns can be matched.  If in a state none of the patterns match the input, it is a lexical error.  Each state must detect this condition and return a LEXICALERROR token whose value is the offending character,  and whose line and column location is the location of that character in the input.

In addition, the scanner should return an EOF token upon reaching the end of the input file. The EOF token does not match any characters in the input file,  but is important later as parser input and must be tokenized.  Its line and column numbers should be reported as -1 and it has null value.

Example(s):

Input: "alfa""
Result: <?xml version="1.0" encoding="UTF-8"?>
<OnyxSource filename="danglingquot.onyx">
   <token column="1" id="45" line="1">alfa</token>
   <token column="7" id="64" line="1">"</token>
   <token column="-1" id="0" line="-1"/>
</OnyxSource>

In the examples shown above, the sequence of characters "alfa" is parsed as a string literal as expected based on the definition of string literal. The last quote (") results in a lexical error since there is is no match in any defined state for a lone quote.