The formal grammar of XML is given in this specification using a simple Extended Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form
symbol ::= expression
Symbols are written with an initial capital letter if they are the start symbol of a regular language, otherwise with an initial lowercase letter. Literal strings are quoted.
Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:
-
#xN
where
N
is a hexadecimal integer, the expression matches the character whose number (code point) in ISO/IEC 10646 isN
. The number of leading zeros in the#xN
form is insignificant. -
[a-zA-Z]
,[#xN-#xN]
matches any Char with a value in the range(s) indicated (inclusive).
-
[abc]
,[#xN#xN#xN]
matches any Char with a value among the characters enumerated. Enumerations and ranges can be mixed in one set of brackets.
-
[^a-z]
,[^#xN-#xN]
matches any Char with a value outside the range indicated.
-
[^abc]
,[^#xN#xN#xN]
matches any Char with a value not among the characters given. Enumerations and ranges of forbidden values can be mixed in one set of brackets.
-
"string"
matches a literal string matching that given inside the double quotes.
-
'string'
matches a literal string matching that given inside the single quotes.
These symbols may be combined to match more complex patterns as follows, where
A
and B
represent simple expressions:
-
(expression)
expression
is treated as a unit and may be combined as described in this list. -
A?
matches
A
or nothing; optionalA
. -
A B
matches
A
followed byB
. This operator has higher precedence than alternation; thusA B | C D
is identical to(A B) | (C D)
. -
A | B
matches
A
orB
. -
A - B
matches any string that matches
A
but does not matchB
. -
A+
matches one or more occurrences of
A
. Concatenation has higher precedence than alternation; thusA+ | B+
is identical to(A+) | (B+)
. -
A*
matches zero or more occurrences of
A
. Concatenation has higher precedence than alternation; thusA* | B*
is identical to(A*) | (B*)
.
Other notations used in the productions are:
-
/* ... */
comment.