MicroXML -- Editor's Draft

John Cowan

2012-09-08

This version: http://www.ccil.org/~cowan/MicroXML.html

Copyright © 2012 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.

Abstract

MicroXML is a subset of XML intended for use in contexts where full XML is, or is perceived to be, too large and complex. MicroXML provides a set of rules for defining markup languages intended for use in encoding data objects, and specifies behavior for certain software modules that access them.

Status of this Document

This document is a private skunkworks and has no official standing of any kind, not having been reviewed by any organization in any way.

This draft was edited by John Cowan from bits and pieces of W3C and non-W3C documents edited by himself, James Clark, Tim Bray, Dave Hollander, Andrew Layman, Eve Maler, Jonathan Marsh, Jean Paoli, C. Michael Sperberg-McQueen, and Richard Tobin. There should be no suggestion that anybody other than John Cowan approves of the content or even the existence of the present document.

The copyright statement above applies to much of the text assembled for this document, but should not be taken as an indication that the W3C approves of the contents or existence of this document.

1 Introduction

MicroXML describes a class of data objects called MicroXML documents, or just documents, provides a data model for them, and partially describes the behavior of computer programs which process them. By construction, MicroXML documents are well-formed XML 5th Edition documents.

The creation of an XML subset can be justified even though the costs of XML complexity have already been paid, for at least the following reasons:

The goals of MicroXML are as follows:

  1. The syntax of MicroXML shall be a subset of XML 1.0.
  2. MicroXML shall specify a data model and a mapping from the syntax to the data model, which shall be substantially consistent with XML 1.0.
  3. MicroXML shall be dramatically simpler than XML as regards its specification, syntax and data model.
  4. MicroXML shall be designed to complement rather than replace XML, JSON and HTML.
  5. MicroXML shall support the needs of documents, in particular mixed content.
  6. MicroXML shall support Unicode.
  7. MicroXML shall support the use of text editors for authoring.
  8. MicroXML shall be able to straightforwardly represent HTML content.
  9. The specification of MicroXML shall be as self-contained as is practical.

MicroXML documents are made up of characters, some of which form character data, and some of which form markup. Markup primarily encodes a description of the document's logical structure.

A software module called a MicroXML processor is used to read MicroXML documents and provide access to their content and structure. It is assumed that a MicroXML processor is doing its work on behalf of another module called the application. This specification describes the behavior of a MicroXML processor in terms of how it MUST process MicroXML documents and what information it MUST, SHOULD, and MAY provide to the application.

This specification, together with [RFC 2119] for requirement keywords and [Unicode] for characters, provides all the information necessary to understand MicroXML and construct computer programs to process it.

This version of the MicroXML specification can be distributed freely, as long as all text and legal notices remain intact.

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

2 Syntax

A sequence of characters is a MicroXML document if taken as a whole, it matches the production labeled "[1] document", and meets the further constraints found in the text of this specification marked with the keywords MUST or REQUIRED.

2.1 Documents

[1] document ::= (comment | pi | s)* element (comment | s)*

Each document contains a single element called the root element, plus OPTIONAL comments and whitespace before and after it, plus OPTIONAL processing instructions before (but not after) it.

Here is a simple example of a document whose root element is named greeting:

<greeting><w>Hello</w> <w>world</w>!</greeting>

2.2 Elements

[4] element ::= startTag content endTag
              | emptyElementTag
[5] content ::= (element | comment | pi | dataChar | charRef)*
[6] startTag ::= '<' name (s+ attribute)* s* '>'
[7] emptyElementTag ::= '<' name (s+ attribute)* s* '/>'
[8] endTag ::= '</' name s* '>'

Elements are the basic building blocks of documents. An element MAY contain a span of text called its content. The boundaries of the content are delimited by start-tags and end-tags. In addition, an empty element (one which contains no content) MAY be identified by an empty-element tag, which is equivalent to the corresponding start-tag immediately followed by the corresponding end-tag. In addition, each element MUST have a name and MAY have one or more attributes. Each attribute has an attribute name and an attribute value.

Here are examples of start-tags:

<foo>
<bar baz="foo">

Here are examples of corresponding end-tags:

</foo>
</bar>

Here is an example of an empty-element tag:

<Image align="left" src="http://www.example.org/Icons/madonna" />

A start-tag begins with <, followed by the name of the element, followed by OPTIONAL attributes, followed by >. An end-tag begins with </ followed by the name of the element, followed by >. An empty-element tag is like a start-tag, but ends with /> instead of >. Whitespace MUST be used before each attribute and MAY be used before > or />.

The end of every element that begins with a start-tag MUST be marked by an end-tag containing a name that matches the element name given in the start-tag.

Element names are drawn from a restricted character repertoire; see section 2.8.

This specification does not constrain the semantics or use of element names.

For all elements other than the root element, if the start-tag is in the content of another element, the end-tag MUST be in the content of the same element; elements MUST nest properly.

2.3 Attributes

[9] attribute ::= attributeName s* '=' s* attributeValue
[10] attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"'
                      | "'" ((attributeValueChar - "'") | charRef)* "'"
[11] attributeValueChar ::= char - ('<'|'>'|'&')
[12] attributeName ::= name - 'xmlns'

An attribute consists of an attribute name, followed by =, followed by a quoted attribute value. Either single or double quotes can be used around the value. Attribute values MUST NOT contain the <, >, or & characters except in the form of a character reference. Likewise, single-quoted attribute values MUST NOT contain single quotes except in the form of a character reference, and similarly for double-quoted attribute values.

Attribute names are drawn from a restricted character repertoire; see sections 2.8. To avoid incompatibility with XML Namespaces, the attribute name xmlns MUST NOT be used.

The order of attributes in a start-tag or empty-element tag is not significant.

An attribute name MUST NOT appear more than once in the same start-tag or empty-element tag.

This specification does not constrain the semantics or use of attribute names except for those beginning with xml.

` Character data

[13] dataChar ::= char - ('<'|'&'|'>')

All text in a document that is not markup constitutes the character data of the document and of the most immediate element in which it exists. Note that whitespace outside the root element is markup, not character data. Any legal MicroXML character can be a data character except <, which signals the beginning of an element; &, which signals the beginning of a character reference, and >, which is forbidden for simplicity and compatibility with XML. If these characters are to appear in the data model of a document, they MUST appear as character references.

2.5 Character references

[14] charRef ::= hexCharRef | namedCharRef
[16] hexCharRef ::= '&#x' [0-9a-fA-F]+ ';'
[17] namedCharRef ::= '&' charName ';'
[18] charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'

A character reference in character content or attribute values stands for a specific Unicode character. Characters referred to using character references MUST match the production for char (see Section 2.10).

If the character reference begins with &#x, the digits and letters up to the terminating semicolon provide a hexadecimal representation of the character's code point in Unicode. If it begins just with &#, the digits up to the terminating semicolon provide a decimal representation of the character's code point.

For readability, a set of predefined character references is also provided for the purpose of escaping MicroXML's special characters: &amp; for &, &lt; for <, &gt; for >, &apos; for ', and &quot; for ". This has exactly the same effect as using character references: &#60; for <, &#38; for &, and so on.

Examples of character references: &lt; for LESS THAN, &#xA0; for NON-BREAKING SPACE, &#916; for GREEK CAPITAL LETTER DELTA, &#66352; or &#x10330; for GOTHIC LETTER AHSA.

2.6 Comments

[19] comment ::= '<!--' ((char - '-') | ('-' (char - '-')))* '-->'

Comments are provided in MicroXML for human consumption only, and are not part of the MicroXML data model. They MAY appear before or after the root element, or anywhere else in a document except inside other markup.

A comment begins with <!-- and ends with -->. For compatibility with XML, a comment MUST NOT contain -- anywhere except as part of the beginning or end.

An example of a comment (note that <head> and <body> are not start-tags):

<!-- declarations for <head> & <body> -->

2.7 Processing Instructions

[22] pi ::= '<?' target (s+ attribute)* s* '?>'
[23] target = name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

Processing instructions (PIs) allow documents to contain instructions for applications. PIs MAY appear before the root element, and MUST NOT appear within or after it. The order of PIs in the document is not significant.

A PI begins with a name called a target, which is used to identify the application to which the instruction is directed, and contains attributes which give the application information on how to process the PI. For compatibility with XML, the target name xml in any combination of upper and lower case characters MUST NOT be used.

An example of a processing instruction:

<?xml-stylesheet type="text/css" href="style.css"?>

2.8 Names

[24] name ::= nameStartChar nameChar*
[25] nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D]
                     | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF]
                     | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[26] nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

Element and attribute names use only a subset of the legal MicroXML characters. The first character of a name MUST be a nameStartChar, and any other characters MUST be nameChars; this mechanism is used to prevent names from beginning with European (ASCII) digits or with basic combining characters. Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive. See section 8 for suggestions on how to create names.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where MicroXML names are used outside MicroXML documents. Providing this group gives those contexts hard guarantees about what cannot be part of a MicroXML name. The character #x037E GREEK QUESTION MARK, is excluded because when normalized it becomes a semicolon. Note that #x2D HYPHEN-MINUS, #x2E FULL STOP (period), #x5F LOW LINE (underscore), and #xB7 MIDDLE DOT are explicitly permitted.

Names beginning with a match to (('X'|'x')('M'|'m')('L'|'l')) are reserved for standardization by the W3C.

2.9 Whitespace

[27] s ::= #x9 | #xA | #x20

Whitespace consists of tabs, newlines, and spaces, all of which are permitted in various places within markup to increase readability.

2.10 Characters

[28] char ::= s | ([#x21-#x10FFFF] - forbiddenChar)
[29] forbiddenChar ::= [#x7F-#x9F] | surrogateCodePoint
                     | [#xFDD0-#xFDEF] | [#xFFFE-#xFFFF] | [#x1FFFE-#x1FFFF]
                     | [#x2FFFE-#x2FFFF] | [#x3FFFE-#x3FFFF] | [#x4FFFE-#x4FFFF]
                     | [#x5FFFE-#x5FFFF] | [#x6FFFE-#x6FFFF] | [#x7FFFE-#x7FFFF]
                     | [#x8FFFE-#x8FFFF] | [#x9FFFE-#x9FFFF] | [#xAFFFE-#xAFFFF]
                     | [#xBFFFE-#xBFFFF] | [#xCFFFE-#xCFFFF] | [#xDFFFE-#xDFFFF]
                     | [#xEFFFE-#xEFFFF] | [#xFFFFE-#xFFFFF] | [#x10FFFE-#x10FFFF]
[30] surrogateCodePoint ::= [#xD800-#xDFFF]

Documents contain text, a sequence of characters, which represent markup or character data. A character is an atomic unit of text as specified by [Unicode]. The legal MicroXML characters exclude the ISO control characters (except those used as whitespace) and the full set of Unicode non-characters, as well as the Unicode surrogate code points (which are not actually Unicode characters). Unassigned Unicode code points are explicitly permitted. Do not confuse code points with UTF-8 or UTF-16 code units, or with octets.

To simplify the tasks of applications, MicroXML processors MUST behave as if they normalized all line breaks in documents before parsing them by translating both the two-character sequence #xD #xA, and any #xD that is not followed by #xA, to a single #xA character. Document authors are, however, encouraged to avoid "compatibility characters" as defined in [Unicode]).

Documents MAY begin with the Byte Order Mark described by [Unicode], also known as #xFEFF ZERO WIDTH NO-BREAK SPACE. This is an encoding signature, not part of either the markup or the character data of the MicroXML document.

[Unicode] says that canonically equivalent sequences of characters ought to be treated as identical. However, documents that are canonically equivalent according to Unicode but which use distinct code point sequences are considered distinct by MicroXML processors. Therefore, all documents SHOULD be in Normalization Form C as described by [Unicode]. Otherwise the user might unknowingly create canonically equivalent but unequal sequences that appear identical to the user but which are treated as distinct by MicroXML processors.

MicroXML processors MAY verify that their input is normalized, and MAY report non-normalized character sequences.

3 The Data Model

This section defines an abstract data set called the MicroXML data model. It exists to provide:

The contents of the data model for a document are designed to convey its structure and content as expressed by its markup and character data. However, there are some items of markup which have no effect on the contents of the data model: the DOCTYPE declaration, comments, and processing instructions. The use or non-use of character references for non-reserved characters also has no effect.

The MicroXML data model does not require or favor a specific interface or class of interfaces. This specification presents the data model as a tree for the sake of clarity and simplicity, but there is no requirement that the the model be made available through a tree structure; other types of interfaces, including (but not limited to) event-based and query-based interfaces, are also capable of providing information conforming to the MicroXML data model.

The terms data model and element object are similar in meaning to the generic terms tree and node as they are used in computing. However, the former terms are used in this specification to reduce possible confusion with other specific data models. Element objects do not map one-to-one with the nodes of the DOM or the tree and nodes of the XPath data model.

3.1 Element Objects

A document's data model contains at least one element object. An element object is an abstract description of a single element in a document. Each element object has three associated properties: the name, the attribute map, and the sequence of children. The name is a string, the attribute map maps name strings to value strings, and each child in the sequence is either a string representing character data or an element object.

There is one element object in the data model for each element appearing in the document being modeled. One element object corresponds to the root of the element tree, and all other element objects are accessible by recursively following the sequence of its children.

3.2 Processing Instruction Objects

A document's data model can also contain processing instruction objects. A processing instruction object is similar to an element object, but has only two properties, the name and the attribute map, because it has no children. There is one processing instruction object in the data model for each processing instruction appearing in the document being modeled. The set of processing instruction objects is unordered.

3.3 Synthetic data models

This specification describes the data model resulting from parsing a MicroXML document. Data models MAY be constructed by other means, for example by use of an API or by transforming an existing data model.

4 Conformance

4.1 UTF-8 Encoding

MicroXML documents MUST be plain text encoded in UTF-8 [Unicode].

4.2 Syntax Checking

Conforming MicroXML processors MUST detect and report violations of this specification's grammar and other constraints in documents they process. If such violations exist, the documents are by definition not MicroXML documents.

When any such violation is encountered, the MicroXML processor MAY attempt to continue processing the document, or MAY abandon processing and report a non-continuable error to the application. This is different from the corresponding rule for XML.

4.3 MicroXML Processors and the MicroXML Data Model

Conforming MicroXML processors MUST provide a mechanism to make the complete data model available to applications. Processors SHOULD NOT make comments available to the application, to prevent them from being used in place of elements, attributes, or processing instructions.

5 Notation

The formal grammar of MicroXML is given in this specification using a simple Extended Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form symbol ::= expression.

Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:

#xN
where N is a hexadecimal integer, the expression matches the character in Unicode whose code point has the value indicated.
[a-zA-Z], [#xN-#xN]
matches any character with a value in the range(s) indicated (inclusive).
[abc], [#xN#xN#xN]
matches any character with a value among the characters enumerated. Enumerations and ranges can be mixed in one set of brackets.
"string"
matches a literal string matching the one given inside the double quotes.
'string'
matches a literal string matching the one given inside the single quotes.

These symbols can be combined to match more complex patterns as follows, where A and B represent expressions:

(A)
expression is treated as a unit and can be combined as described in this list.
A?
matches A or nothing; OPTIONAL A.
A B
matches A followed by B. This operator has higher precedence than alternation; thus A B | C D is identical to (A B) | (C D).
A | B
matches A or B.
A - B
matches any string that matches A but does not match B.
A+
matches one or more occurrences of A. This operation has higher precedence than alternation; thus A+ | B+ is identical to (A+) | (B+).
A*
matches zero or more occurrences of A. This operation has higher precedence than alternation; thus A* | B* is identical to (A*) | (B*).

6 Suggestions for MicroXML Names (Non-Normative)

The following suggestions define what is believed to be best practice in the construction of MicroXML names. All references to Unicode are understood with respect to a particular version of the Unicode Standard greater than or equal to 5.0; which version is used is left to the discretion of the document author or schema designer.

The first two suggestions exclude all control characters, enclosing nonspacing marks, non-decimal numbers, private-use characters, punctuation characters (with the noted exceptions), symbol characters, unassigned code points, and whitespace characters.

7 References

While these references cite a particular edition of a specification, conforming implementations of MicroXML MAY support later editions either in addition or as replacements, thus allowing MicroXML users to benefit from corrections and extensions to the other specifications on which it depends.

Unicode
The Unicode Consortium. The Unicode Standard, Version 6.0.0, (Mountain View, CA: The Unicode Consortium, 2011. ISBN 978-1-936213-01-6)
RFC 2119
IETF (Internet Engineering Task Force). RFC 2119: Key words for use in RFCs to Indicate Requirement Levels. Scott Bradner, 1997. (See http://www.ietf.org/rfc/rfc2119.txt.)