Onyx XML/DOM Appendix

OXML ODOM
Description Features
Document Structure / Node Types  
XML Document Conversion  
Entity References API Specification / Javadoc

 


Changelog

Date Description
10/27/2006 Initial version of the document



Onyx XML

Description

Onyx XML, hereto OXML, is a subset of the Extensible Markup Language (XML) as documented and standardized by the W3C. XML provides a wide range of features, parameters, and structures which can be used for both persistent data as well as serialized data respresentation for network communication. Another key area in which XML is widely used is as a form of storage for databases. In the spirit of a kernel language and a move toward simplicity, OXML restricts the XML specification by providing the key functionality that is needed to specify, store, and query flatfile OXML databases.

The following terminology is used to describe the parts of an OXML document both in the ODOM API and the remainder of this document:

OXML <tagname attributeName="attributeValue" type="OnyxTextNode">content of a text node</tagname>
<tagname attributeName="attributeValue" type="OnyxElement">
   <contentOfAnElementNode/>
</tagname>

The OXML element can be described in the following way. The first element is a node whose tagname is "tagname" and contains two attributes, one named "attributeName" and the other "type". The attribute with the name "attributeName" has the value "attributeValue" and the attribute with the name "type" and the value "OnyxTextNode". The content of the node is the string "content of a text node". Note that the names and values described do not include the enclosing quotation marks. The second element is a node with a tagname "tagname" and two attributes just as in the first node. The value of the attribute named "type" in the second node is "OnyxElement". The content of the second node is a node whose tagname is "contentOfAnElementNode" and has no attributes. For any given node, the set of attributes it has is collectively referred to as its attribute environment. More specifically, an attribute environment is a set of key-value pairs in which all the keys are unique.

One of the key differences between XML and OXML is the restriction on the content of any given node. In OXML, content of a given node may only contain text or other nodes exclusively. That is to say that OXML does not permit mixed content. XML node types such as comments, character data, and processing instructions are not supported by OXML, with the exception of the xml processing instruction "<?xml version="1.0" encoding="UTF-8"?>" which is at the top of every valid XML file. This is sufficed to say that an OXML document is always valid XML, whereas not all XML documents are valid OXML documents.

Document Structure / Node Types

Due to the content restriction for every OXML node, nodes are classified into two types. OnyxElement or (eNodes) and OnyxTextNode (tNodes). An OnyxElement is a node which contains only other nodes. An OnyxTextNode is a node that contains only text. An OXML document contains a single node, possibly a tNode or an eNode.

XML Document Conversion

Onyx provides the capability of reading in an XML document on which to run queries. Since XML documents may not be OXML compliant, Onyx uses a set of rules to make a compliant OXML document. In performing the conversion, every node and its children is checked and classified as either a tNode or and eNode based on its content. An XML node with no content, which has the form <tagname/> is converted to an eNode with no children as is an XML node of the form <tagname></tagname>.

A node whose first non-whitespace content is another node is classified as an eNode. Once classified as an eNode, all children of the node that are not other nodes are removed. To illustrate this conversion, as well as the meaning of non-whitespace content, see the following example:

XML <a>   <b>a string</b><c/> some random text </a>
OXML <a><b>a string</b></c></a>

In the above example, the node with the tagname "a" contains mixed context of nodes (b,c) and text (some random text). According to the rule, the first non-whitespace content is the node "b", thus "a" is classified as an eNode and all non-node children are removed, hence the removal of the string " some random text ". One thing to note is that the standard XML DOM will classify the whitespace between the start tag "a" and the start tag "b" to be text content containing spaces. Thus, we doing the conversion such leading white-space is stripped away and not considered when performing the classification.

XML <a>   some text <b>a string</b><c/> </a>
OXML <a>some text</a>

A node whose first non-whitespace content is text is classified as a tnode. A node whose only content is whitespace is is also a tnode. Subsequently, all children occurring after the detected text is stripped away. The example shown above illustrates this. The string "some text" is the first content after the initial whitespace. Thus the node "a" is classified as a tnode and all other children are removed.

Attributes are considered to be the the same in both XML and OXML. If an attribute environment of some XML node contains two or more identical keys, only the last one in the environment definition will be considered. This behavior is consistent with the OXML notion of replace the value of an already existing attribute.

Entity References

When processing an Onyx program, all string literals are processed so that entity references are replaced with their equivalent character representation. The two tables below show the reference conversions that must be done when processing any Onyx string.

Entity Reference Refers To
""" "
"'" '
"&" &
">" >
"<" <

Character Reference Refers To
"&#" [digit]+ ";" The Unicode character with the given decimal codepoint value, if it exists, else undefined.
"&#x" 4([HexDigit]) ";" The Unicode character with the given hexadecimal codepoint value, if it exists, else undefined.

Thus the value of the string literal does not contain any references. Since XML does not allow certain characters, all illegal characters must be converted back to their entity references before being displayed as XML content. According to the XML standard, text content of a node may contain the " and ' characters however should not contain the <, >, and & characters. Thus, these three characters must be converted back to their entity reference equivalents. In addition to these three restrictions, attribute values may not contain the " and ' characters as well. In short, all five entity references must be covereted for attribute values and only three for text content. This conversion is handled automatically by the onyx_xml ODOM package.