Extensible Markup Language (XML) is intended to be a universal format for structured data. It provides a text-oriented, tree-structured format that with suitable conventions can be used to represent many kinds of complex data structures. Its ease of processing and its flexibility have made XML popular for many applications.
An important virtue of XML is its simple syntax. (Rumor has it that one committee goal was that an experienced Perl hacker might build an XML parser in a weekend.) In well-formed XML documents, the hierarchical structure is readily apparent without knowledge of the data.
In this class we will be using a subset of XML called Onyx XML (OXML). Since OXML is a subset of XML, OXML documents are also well formed XML. An OXML document represents data as a structured collection of elements. The contents of an element must either be other elements or text, but not a mixture of both. Elements are delimited by tags. Tags are in turn delimited by angle brackets ('<', ' >'), and must follow precise rules on structure and nesting. Text also follows strict rules, in particular a prohibition on the use of the tag delimiter characters (see discussion of character entities below).
XML and OXML look very similar to HTML (HyperText Markup Language), the web page markup language you may be familiar with. (Both languages have a common ancestor in SGML, the Standard Generalized Markup Language.) Here are examples of HTML and OXML documents:
|
HTML |
OXML |
|
1 <HTML> 8 Welcome to CSE 131A! |
1 <?xml
version="1.0" encoding="utf-8"?> 8 <break/> |
Although these two texts are similar, there are important differences. Each well-formed OXML file must begin with an XML declaration (OXML line 1). The HTML tag names are defined in the HTML standard and have an intended semantics in terms of page display, but the XML tag names are arbitrary. There are no predefined tag names in the XML standard as such, and no predefined semantics of tags, for purposes of formatting or anything else. The ability to define new tags and semantics for those tags (subject to the basic constraints of XML syntax) is what makes XML extensible.
One significant syntactic difference is shown between line 7 of the HTML sample and line 8 of the OXML sample. Compare the <BR> HTML element to the <break/> OXML element. The terminating /> indicates that the tag has no embedded content. Unlike HTML, in XML embedded content is inferred strictly from the syntax, not the element name. The HTML sample is not well-formed OXML for three reasons: 1) the <BR> tag is not closed 2) The body element has mixed content and 3) there is no xml declaration. (At this time, HTML is in the process of being superceded by the XHTML standard, which reformulates HTML as well-formed XML.)
In XML, there are three forms for tags: start-tag, end-tag, and empty-element tags. Start-tags and end-tags are used to delimit nested data, while empty-elements define self-contained data. Each tag form has a unique syntax:
|
Tag Form |
Example |
Notes |
|
start-tag |
<classroom> |
No closing syntax (/> or </ ) in tag. |
|
end-tag |
</classroom> |
Tag starts with a closing syntax (</ ). |
|
empty-element-tag |
<break/> |
Tag ends with a closing syntax (/>). |
The tag name (e.g. classroom) must immediately follow the tag start < or </). No intervening space is allowed. Spaces are permitted before the tag terminator (> or /> ).
Any elements or text between matching start-tags and end-tags are treated as the content of the element delimited by the tags. In the OXML example above, the text "CSE 131A" is the content of the title element. The tags and text between <doc> and </doc> constitute the content of the doc element. The <break/> tag is an empty-element.
Start-tags and empty elements may also include any number of attributes. Attributes define a name/value binding that is associated with the tag. Each attribute may only occur in a given tag at most once. Attribute values are always quoted strings. Here are some examples of tags with attributes and text content:
<element symbol="H" number="1">Hydrogen</element>
<element symbol="Fe" number=56">Iron</element>
<molecule formula="H2O">Water</molecule>
<molecule formula="H2O2">Hydrogen Peroxide</molecule>
Text content does not add structure, but constitutes the basic content of the document. Text in XML documents consists of a sequence of characters, with some restrictions. The key restriction is a prohibition against the use of the XML markup characters in text: "< ", ">", and "&". These are reserved to demarcate XML tags or character entities that may occur among the text. Character entities are a notation used in text to represent the prohibited characters. XML defines the character entities < > & for this purpose, and also provides ' and " which can be convenient for representing single-quote and double-quote characters in strings.
The syntax of an OXML document is simple, and can be defined quite easily using a BNF-like context-free grammar (some slight simplification here compared to the full XML standard):
Document ::= Prolog Element
Element ::= EmptyElementTag | StartTag Content EndTag
Content ::= Text | Element*
TextData ::= [^<&]*
Following
these BNF productions, we can see that a well-formed OXML document has the
hierarchical structure of a tree. The document begins with a prolog,
which contains the XML declaration and perhaps other information, but which is
not considered part of the content of the document. Then there is exactly
one element, called the root, or document element, at the top of
the hierarchy. This root element contains all content in the
document. For every other element E, if its start-tag is contained
in an element F, the end-tag of E is also contained in F. Thus elements,
delimited by start- and end-tags, nest within each other.
As a result of this, for each non-root element E in the document, there is
exactly one other element P in the document such that E is contained in P, but
E is not contained in any other element that is contained in P. This element P
is the parent of E, and E is a child of P. (In the Document Object Model,
text matching the TextData production above is also considered to be a node in
the document tree; each text node has a unique parent but no children.)
The tree relations of sibling, ancestor, and descendant are defined in
the usual way. This simple hierarchical structure makes parsing and
manipulating XML documents relatively straightforward.
The full XML standard also includes comments, CDATA sections, user defined character entities, processing instructions, name spaces, and document type declarations. These additions address common challenges in commercial applications. For example, CDATA sections can simplify support for large text elements. For this class, we will omit some of these features for simplicity.
Each of these other types of content has a simple and distinctive lexical form. This ensures that XML documents remain syntactically simple. For example, the lexical structure for processing instructions and comments is similar to the lexical structure of empty element tags, using the angle bracket to define a distinctive header and terminator patterns. In particular, XML comments use the patterns <!-- and --> to bracket the comment text. XML comments take the form:
<!-- This is an XML comment -->
An additional constraint in well-formed XML (which cannot be specified in the BNF, but can easily be handled as a semantic check in a parser), is that the name in an EndTag must match the name in the corresponding StartTag.
The official documentation for XML is maintained and developed by the World-Wide-Web Consortium (W3C). This documentation takes the form of recommendations. The core recommendations for standard XML are found at the W3C web site on XML publications located at http://www.w3.org/XML/Core/#Publications and are:
Extensible Markup Language (XML).
The formal definition of the XML language. This W3C recommendations describes the basic syntax for XML documents.
Namespaces in XML 1.0 (2nd Ed), 16 August 206
XML namespaces provide a simple method for qualifying XML element names and attribute names. This recommendation used colons ":" to associate elements and attributes URI references. This is useful for avoiding unintended name sharing in large applications.
XML is one of many languages that are derived from SGML (Standard Generalized Markup Language [ISO 8879]). XML adopts many of the lexical conventions from SGML, and simplifies the languages in several ways. One important carry-over from SGML are DTD's (Data Type Definitions). DTDs permit defining a required structure that an XML document must follow, and are used to validate the contents of XML documents. A more sophisticated and complex way of specifying the allowed structure of an XML document is available through XMLSchema. XMLSchema can impose a full flexible type system on XML elements, and this type system is intended to be used by XQuery.
A standard for describing the structure and constraints for XML documents. This standard allows data type definitions using XML elements.
The simple syntax of the XML standard has led to a large body of pre-packaged software for processing XML files. Among many relevant references here are:
The standard API for reading XML documents. This interface uses an event driven model to simplify application construction.
The DOM defines an in-memory model of an XML document. The in-memory model simplifies traversals over the logical data structure of the document.
XSLT: XML Stylesheet Language Transforms
XSLT is a language for transforming XML documents into other XML documents.
These standards typically now enjoy a choice of implementations that are available as libraries (e.g. Java 1.5 SDK) or as stand-alone applications.
In addition, there are a large number of proposals for defining semantics for standard XML tags. These proposals tend to be specific to their local domain, but some domains are very common. Notable standards for element names include these:
A reformulation of to the HTML standard to comply with the XML conventions for tag names and empty-element tags.
A standard to create and define links between resources. This standard supports reference and other non-hierarchical connections between XML elements.
Some significant repositories of XML documents are freely available on the web. These may be useful in testing the performance of XQuery implementations. However be aware that these documents may not be OXML compliant.
· GCIDE_XML , The GNU version of The Collaborative International Dictionary of English .
· National & American League 1998 Statistics
· HTML Writers Guild Project Gutenberg . Contains many classic texts.
· Open Directory RDF RDF dumps of the Open Directory database. 100+ megs.