| Date | Description | 1/17/07 | Slight correction to the rule for comments [SBB] |
The scanner for Assignment 1 should recognize the following lexical tokens for Onyx language. Each token has an associated integer ID as defined in the file onyx_sym.java. Although the name of a final variable defined in that file may be somewhat different from its corresponding name below, the equivalence should be sufficiently clear. Also note that the regular expression syntax here may be different from that used by any particular scanner generator tool.
Tokens are broadly grouped into the following categories:
Keywords | Punctuation and Operator Symbols | QNames, Variables, and Literals
Comments
| Whitespace | Lexical
Errors and EOF
The scanner should appropriately recognize the following Onyx keywords:
| declare | function | variable |
as |
|
if |
then | else | for |
| let | where | return | in |
| to |
except | before | after |
| ordered-by | descending |
ascending | intersect |
| union | le | lt | ge |
| gt | eq |
ne | and |
| or | div |
idiv |
mod |
Onyx is case-sensitive, and there is only one way to spell each of
the keywords, so there is only one possible lexeme for each keyword
token. Therefore, these keyword tokens are completely determined by
their integer ID, and are considered to have null value. The
location of
a keyword token
is the location of the first character in the matched lexeme.
The scanner should recognize the following Onyx punctuation symbols and symbolic operators:
| / | Slash | // | Slash Slash | |
| = | Equals | != | Not Equals | |
| <= | Less Than or Equals | >= | Greater Than or Equals | |
| < | Less Than | > |
Greater Than |
|
| := | Colon Equals | * |
Mult |
|
| - | Minus | + | Plus | |
| | |
Vertical Bar |
, | Comma | |
| ( | Left Parenthesis | ) | Right Parenthesis | |
| [ | Left Square Bracket | ] | Right Square Bracket | |
| { | Left Curly Brace | } | Right Curly Brace | |
| ; | Semicolon |
The scanner should recognize the following tokens. Unlike keywords
and punctuation, an instance of these tokens will have a nontrivial
string value based on the specific matching lexeme in addition to
their token id.
A QName in Onyx is an identifier name and differs from XML qualified name because namespace specifiers are not supported.
The value of a QName token is the matched lexeme. The
location for a QName token is the line and column of the first
character in the matched lexeme.
The value of a VariableName token is the matched lexeme. The location for a VariableName token is the line and column of the first character in the matched lexeme, i.e. the $ character.
Onyx strings are delimited with quotation marks ("). The quotation mark itself can be embedded in the string with a doubled occurrence.The " (Quote) character in Onyx strings are only recognized as Quote if the character sequence Quote Quote occurs in a terminated string.
Predefined entity references and Unicode character
references can appear within a string, and when constructing the value
of the string, must
be interpreted as referring to the appropriate character.
The five predefined entity references are:
| ENTITY REFERENCE |
REFERS TO |
| "<" |
< |
| ">" |
> |
| "&" |
& |
| """ |
" |
| "'" |
' |
Patterns for Unicode character references are (HexDigit is the character class [0-9a-fA-F] ):
| CHARACTER REFERENCE PATTERN |
REFERS TO |
| "&#" Digit+ ";" |
The Unicode character with the given
decimal codepoint value, if it exists; else undefined |
| "&#x" HexDigit+ ";" |
The Unicode character with the given
hexadecimal codepoint value, if it exists; else undefined |
The value for a string literal token is the sequence of characters that is intended to be represented by the string literal. That is, the value removes the enclosing delimiters and translates any contained doubled delimiter references.
The line and column location for a string literal token is the
location of the quote that begins the literal. For
example, the string literal
has value
and its location is the location of the quotation mark that begins the literal (and which doesn't appear in the value).
and its location is the location of the quote that begins the
literal (and which doesn't appear in the value). (Most browsers
will display that line correctly; the character reference ê
refers to the Unicode character French e-circonflex.)
Onyx comments may not be nested, and they are never tokenized.
Whitespace characters are spaces,
newlines, carriage
returns, and tabs, i.e. the character class [ \n\r\t].
WS is never tokenized as such,
though whitespace characters can appear in the value of some tokens
(such as string literals, etc.).
In each lexical state, only certain patterns can be matched.
If in a state none of the patterns match the input, it is a
lexical error. Each state must detect this condition and return a
LEXICALERROR token whose value is the offending character, and
whose line and column location is the location of that character in the
input.
In addition, the scanner should return an EOF token upon reaching the end of the input file. The EOF token does not match any characters in the input file, but is important later as parser input and must be tokenized. Its line and column numbers should be reported as -1 and it has null value.
Example(s):
| Input: | "alfa"" |
| Result: | <?xml version="1.0" encoding="UTF-8"?> <OnyxSource filename="danglingquot.onyx"> <token column="1" id="45" line="1">alfa</token> <token column="7" id="64" line="1">"</token> <token column="-1" id="0" line="-1"/> </OnyxSource> |
In the examples shown above, the sequence of characters "alfa" is parsed as a string literal as expected based on the definition of string literal. The last quote (") results in a lexical error since there is is no match in any defined state for a lone quote.