Skip to content

Instantly share code, notes, and snippets.

@jessestricker
Created October 12, 2024 19:38
Show Gist options
  • Save jessestricker/fe4b6aa0317e98017cd7e3a2131cdf8a to your computer and use it in GitHub Desktop.
Save jessestricker/fe4b6aa0317e98017cd7e3a2131cdf8a to your computer and use it in GitHub Desktop.
Extended Backus-Naur Form (EBNF) as used by the W3C XML specification.

Notation

The formal grammar of XML is given in this specification using a simple Extended Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one symbol, in the form

symbol ::= expression

Symbols are written with an initial capital letter if they are the start symbol of a regular language, otherwise with an initial lowercase letter. Literal strings are quoted.

Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:

  • #xN

    where N is a hexadecimal integer, the expression matches the character whose number (code point) in ISO/IEC 10646 is N. The number of leading zeros in the #xN form is insignificant.

  • [a-zA-Z], [#xN-#xN]

    matches any Char with a value in the range(s) indicated (inclusive).

  • [abc], [#xN#xN#xN]

    matches any Char with a value among the characters enumerated. Enumerations and ranges can be mixed in one set of brackets.

  • [^a-z], [^#xN-#xN]

    matches any Char with a value outside the range indicated.

  • [^abc], [^#xN#xN#xN]

    matches any Char with a value not among the characters given. Enumerations and ranges of forbidden values can be mixed in one set of brackets.

  • "string"

    matches a literal string matching that given inside the double quotes.

  • 'string'

    matches a literal string matching that given inside the single quotes.

These symbols may be combined to match more complex patterns as follows, where A and B represent simple expressions:

  • (expression)

    expression is treated as a unit and may be combined as described in this list.

  • A?

    matches A or nothing; optional A.

  • A B

    matches A followed by B. This operator has higher precedence than alternation; thus A B | C D is identical to (A B) | (C D).

  • A | B

    matches A or B.

  • A - B

    matches any string that matches A but does not match B.

  • A+

    matches one or more occurrences of A. Concatenation has higher precedence than alternation; thus A+ | B+ is identical to (A+) | (B+).

  • A*

    matches zero or more occurrences of A. Concatenation has higher precedence than alternation; thus A* | B* is identical to (A*) | (B*).

Other notations used in the productions are:

  • /* ... */

    comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment