Copyright © 2012 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
MicroXML is a subset of XML intended for use in contexts where full XML is, or is perceived to be, too large and complex. MicroXML provides a set of rules for defining markup languages intended for use in encoding data objects, and specifies behavior for certain software modules that access them.
This document is a private skunkworks and has no official standing of any kind, not having been reviewed by any organization in any way.
This draft was edited by John Cowan from bits and pieces of W3C and non-W3C documents edited by himself, James Clark, Tim Bray, Dave Hollander, Andrew Layman, Eve Maler, Jonathan Marsh, Jean Paoli, C. Michael Sperberg-McQueen, and Richard Tobin. There should be no suggestion that anybody other than John Cowan approves of the content or even the existence of the present document.
The copyright statement above applies to much of the text assembled for this document, but should not be taken as an indication that the W3C approves of the contents or existence of this document.
MicroXML describes a class of data objects called MicroXML documents, or just documents, provides a data model for them, and partially describes the behavior of computer programs which process them. By construction, MicroXML documents are well-formed XML 5th Edition documents.
The creation of an XML subset can be justified even though the costs of XML complexity have already been paid, for at least the following reasons:
The goals of MicroXML are as follows:
MicroXML documents are made up of characters, some of which form character data, and some of which form markup. Markup primarily encodes a description of the document's logical structure.
A software module called a MicroXML processor is used to read MicroXML documents and provide access to their content and structure. It is assumed that a MicroXML processor is doing its work on behalf of another module called the application. This specification describes the behavior of a MicroXML processor in terms of how it MUST process MicroXML documents and what information it MUST, SHOULD, and MAY provide to the application.
This specification, together with [RFC 2119] for requirement keywords and [Unicode] for characters, provides all the information necessary to understand MicroXML and construct computer programs to process it.
This version of the MicroXML specification can be distributed freely, as long as all text and legal notices remain intact.
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
A sequence of characters is a MicroXML document if taken as a whole, it matches the production labeled " document", and meets the further constraints found in the text of this specification marked with the keywords MUST or REQUIRED.
 document ::= (comment | pi | s)* element (comment | s)*
Each document contains a single element called the root element, plus OPTIONAL comments and whitespace before and after it, plus OPTIONAL processing instructions before (but not after) it.
Here is a simple example of a document whose root element is named greeting:
 element ::= startTag content endTag | emptyElementTag  content ::= (element | comment | pi | dataChar | charRef)*  startTag ::= '<' name (s+ attribute)* s* '>'  emptyElementTag ::= '<' name (s+ attribute)* s* '/>'  endTag ::= '</' name s* '>'
Elements are the basic building blocks of documents. An element MAY contain a span of text called its content. The boundaries of the content are delimited by start-tags and end-tags. In addition, an empty element (one which contains no content) MAY be identified by an empty-element tag, which is equivalent to the corresponding start-tag immediately followed by the corresponding end-tag. In addition, each element MUST have a name and MAY have one or more attributes. Each attribute has an attribute name and an attribute value.
Here are examples of start-tags:
<foo> <bar baz="foo">
Here are examples of corresponding end-tags:
Here is an example of an empty-element tag:
<Image align="left" src="http://www.example.org/Icons/madonna" />
A start-tag begins with
<, followed by the name of the element, followed by
OPTIONAL attributes, followed by
>. An end-tag begins with
</ followed by the name of the element, followed by
empty-element tag is like a start-tag, but ends with
/> instead of >.
Whitespace MUST be used before each attribute and MAY be used
The end of every element that begins with a start-tag MUST be marked by an end-tag containing a name that matches the element name given in the start-tag.
Element names are drawn from a restricted character repertoire; see section 2.8.
This specification does not constrain the semantics or use of element names.
For all elements other than the root element, if the start-tag is in the content of another element, the end-tag MUST be in the content of the same element; elements MUST nest properly.
 attribute ::= attributeName s* '=' s* attributeValue  attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"' | "'" ((attributeValueChar - "'") | charRef)* "'"  attributeValueChar ::= char - ('<'|'>'|'&')  attributeName ::= name - 'xmlns'
An attribute consists of an attribute name, followed by =, followed by a quoted attribute
value. Either single or double quotes can be used around the value. Attribute values
MUST NOT contain the
& characters except in the form of a character reference. Likewise,
single-quoted attribute values MUST NOT contain single quotes except in the
form of a character reference, and similarly for double-quoted attribute values.
Attribute names are drawn from a restricted character repertoire; see sections 2.8. To avoid
incompatibility with XML Namespaces, the attribute name
MUST NOT be used.
The order of attributes in a start-tag or empty-element tag is not significant.
An attribute name MUST NOT appear more than once in the same start-tag or empty-element tag.
This specification does not constrain the semantics or use of attribute names except for
those beginning with
 dataChar ::= char - ('<'|'&'|'>')
All text in a document that is not markup constitutes the character data of the
document and of the most immediate element in which it exists. Note that whitespace outside
the root element is markup, not character data. Any legal MicroXML character can be a data
<, which signals the beginning of an element;
&, which signals the beginning of a character reference, and
>, which is forbidden for simplicity and compatibility with XML. If these
characters are to appear in the data model of a document, they MUST appear as
 charRef ::= hexCharRef | namedCharRef  hexCharRef ::= '&#x' [0-9a-fA-F]+ ';'  namedCharRef ::= '&' charName ';'  charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'
A character reference in character content or attribute values stands for a specific
Unicode character. Characters referred to using character references MUST match
the production for
char (see Section 2.10).
If the character reference begins with
&#x, the digits and letters up to the
terminating semicolon provide a hexadecimal representation of the character's code point in
Unicode. If it begins just with
&#, the digits up to the terminating
semicolon provide a decimal representation of the character's code point.
For readability, a set of predefined character references is also provided for the purpose of
escaping MicroXML's special characters:
This has exactly the same effect as using character references:
&, and so on.
Examples of character references:
GREEK CAPITAL LETTER DELTA,
GOTHIC LETTER AHSA.
 comment ::= '<!--' ((char - '-') | ('-' (char - '-')))* '-->'
Comments are provided in MicroXML for human consumption only, and are not part of the MicroXML data model. They MAY appear before or after the root element, or anywhere else in a document except inside other markup.
A comment begins with
<!-- and ends with
compatibility with XML, a comment MUST NOT contain
except as part of the beginning or end.
An example of a comment (note that
are not start-tags):
<!-- declarations for <head> & <body> -->
 pi ::= '<?' target (s+ attribute)* s* '?>'  target = name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))
Processing instructions (PIs) allow documents to contain instructions for applications. PIs MAY appear before the root element, and MUST NOT appear within or after it. The order of PIs in the document is not significant.
A PI begins with a name called a target, which is used to identify the application to
which the instruction is directed, and contains attributes which give the application
information on how to process the PI. For compatibility with XML, the target name
xml in any combination of upper and lower case characters MUST
NOT be used.
An example of a processing instruction:
<?xml-stylesheet type="text/css" href="style.css"?>
 name ::= nameStartChar nameChar*  nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]  nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
Element and attribute names use only a subset of the legal MicroXML characters. The first
character of a name MUST be a
nameStartChar, and any other
characters MUST be
nameChars; this mechanism is used to prevent
names from beginning with European (ASCII) digits or with basic combining characters. Almost
all characters are permitted in names, except those which either are or reasonably could be
used as delimiters. The intention is to be inclusive rather than exclusive. See section 8 for
suggestions on how to create names.
The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol
characters, are excluded from names because they are more useful as delimiters in contexts
where MicroXML names are used outside MicroXML documents. Providing this group gives those
contexts hard guarantees about what cannot be part of a MicroXML name. The character
#x037E GREEK QUESTION MARK, is excluded because when normalized it becomes a
semicolon. Note that
#x2E FULL STOP (period),
#x5F LOW LINE (underscore), and
#xB7 MIDDLE DOT are explicitly
Names beginning with a match to
(('X'|'x')('M'|'m')('L'|'l')) are reserved for
standardization by the W3C.
 s ::= #x9 | #xA | #x20
Whitespace consists of tabs, newlines, and spaces, all of which are permitted in various places within markup to increase readability.
 char ::= s | ([#x21-#x10FFFF] - forbiddenChar)  forbiddenChar ::= [#x7F-#x9F] | surrogateCodePoint | [#xFDD0-#xFDEF] | [#xFFFE-#xFFFF] | [#x1FFFE-#x1FFFF] | [#x2FFFE-#x2FFFF] | [#x3FFFE-#x3FFFF] | [#x4FFFE-#x4FFFF] | [#x5FFFE-#x5FFFF] | [#x6FFFE-#x6FFFF] | [#x7FFFE-#x7FFFF] | [#x8FFFE-#x8FFFF] | [#x9FFFE-#x9FFFF] | [#xAFFFE-#xAFFFF] | [#xBFFFE-#xBFFFF] | [#xCFFFE-#xCFFFF] | [#xDFFFE-#xDFFFF] | [#xEFFFE-#xEFFFF] | [#xFFFFE-#xFFFFF] | [#x10FFFE-#x10FFFF]  surrogateCodePoint ::= [#xD800-#xDFFF]
Documents contain text, a sequence of characters, which represent markup or character data. A character is an atomic unit of text as specified by [Unicode]. The legal MicroXML characters exclude the ISO control characters (except those used as whitespace) and the full set of Unicode non-characters, as well as the Unicode surrogate code points (which are not actually Unicode characters). Unassigned Unicode code points are explicitly permitted. Do not confuse code points with UTF-8 or UTF-16 code units, or with octets.
To simplify the tasks of applications, MicroXML processors MUST behave as if
they normalized all line breaks in documents before parsing them by translating both the
#xD #xA, and any
#xD that is not followed by
#xA, to a single
#xA character. Document authors are, however,
encouraged to avoid "compatibility characters" as defined in [Unicode]).
Documents MAY begin with the Byte Order Mark described by [Unicode], also
#xFEFF ZERO WIDTH NO-BREAK SPACE. This is an encoding signature, not
part of either the markup or the character data of the MicroXML document.
[Unicode] says that canonically equivalent sequences of characters ought to be treated as identical. However, documents that are canonically equivalent according to Unicode but which use distinct code point sequences are considered distinct by MicroXML processors. Therefore, all documents SHOULD be in Normalization Form C as described by [Unicode]. Otherwise the user might unknowingly create canonically equivalent but unequal sequences that appear identical to the user but which are treated as distinct by MicroXML processors.
MicroXML processors MAY verify that their input is normalized, and MAY report non-normalized character sequences.
This section defines an abstract data set called the MicroXML data model. It exists to provide:
The contents of the data model for a document are designed to convey its structure and content as expressed by its markup and character data. However, there are some items of markup which have no effect on the contents of the data model: the DOCTYPE declaration, comments, and processing instructions. The use or non-use of character references for non-reserved characters also has no effect.
The MicroXML data model does not require or favor a specific interface or class of interfaces. This specification presents the data model as a tree for the sake of clarity and simplicity, but there is no requirement that the the model be made available through a tree structure; other types of interfaces, including (but not limited to) event-based and query-based interfaces, are also capable of providing information conforming to the MicroXML data model.
The terms data model and element object are similar in meaning to the generic terms tree and node as they are used in computing. However, the former terms are used in this specification to reduce possible confusion with other specific data models. Element objects do not map one-to-one with the nodes of the DOM or the tree and nodes of the XPath data model.
A document's data model contains at least one element object. An element object is an abstract description of a single element in a document. Each element object has three associated properties: the name, the attribute map, and the sequence of children. The name is a string, the attribute map maps name strings to value strings, and each child in the sequence is either a string representing character data or an element object.
There is one element object in the data model for each element appearing in the document being modeled. One element object corresponds to the root of the element tree, and all other element objects are accessible by recursively following the sequence of its children.
A document's data model can also contain processing instruction objects. A processing instruction object is similar to an element object, but has only two properties, the name and the attribute map, because it has no children. There is one processing instruction object in the data model for each processing instruction appearing in the document being modeled. The set of processing instruction objects is unordered.
This specification describes the data model resulting from parsing a MicroXML document. Data models MAY be constructed by other means, for example by use of an API or by transforming an existing data model.
MicroXML documents MUST be plain text encoded in UTF-8 [Unicode].
Conforming MicroXML processors MUST detect and report violations of this specification's grammar and other constraints in documents they process. If such violations exist, the documents are by definition not MicroXML documents.
When any such violation is encountered, the MicroXML processor MAY attempt to continue processing the document, or MAY abandon processing and report a non-continuable error to the application. This is different from the corresponding rule for XML.
Conforming MicroXML processors MUST provide a mechanism to make the complete data model available to applications. Processors SHOULD NOT make comments available to the application, to prevent them from being used in place of elements, attributes, or processing instructions.
The formal grammar of MicroXML is given in this specification using a simple Extended
Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form
symbol ::= expression.
Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:
Nis a hexadecimal integer, the expression matches the character in Unicode whose code point has the value indicated.
These symbols can be combined to match more complex patterns as follows, where
B represent expressions:
Aor nothing; OPTIONAL
B. This operator has higher precedence than alternation; thus
A B | C Dis identical to
(A B) | (C D).
A | B
A - B
Abut does not match
A. This operation has higher precedence than alternation; thus
A+ | B+is identical to
(A+) | (B+).
A* | B*is identical to
(A*) | (B*).
The following suggestions define what is believed to be best practice in the construction of MicroXML names. All references to Unicode are understood with respect to a particular version of the Unicode Standard greater than or equal to 5.0; which version is used is left to the discretion of the document author or schema designer.
The first two suggestions exclude all control characters, enclosing nonspacing marks, non-decimal numbers, private-use characters, punctuation characters (with the noted exceptions), symbol characters, unassigned code points, and whitespace characters.
ID_Start, or else be
#x5F LOW LINE(underscore).
ID_Continue, or be one of the characters listed in the table entitled "Characters for Natural Language Identifiers" in UAX #31, with the exception of
x27 APOSTROPHE# and
#x2019 RIGHT SINGLE QUOTATION MARK.
[#x2F800-#x2FFFD], with 12 exceptions) SHOULD NOT be used in names.
#x0E33 THAI CHARACTER SARA AMor
#x0EB3 LAO CHARACTER AM, which despite their compatibility decompositions are in regular use in those scripts.)
[#x1D165-#x1D1AD]) SHOULD NOT be used in names.
[#xFFF9-#xFFFB]) SHOULD NOT be used in names.
While these references cite a particular edition of a specification, conforming implementations of MicroXML MAY support later editions either in addition or as replacements, thus allowing MicroXML users to benefit from corrections and extensions to the other specifications on which it depends.