Skip to content

Instantly share code, notes, and snippets.

@ben221199
Last active July 8, 2024 16:57
Show Gist options
  • Save ben221199/40e84e2df6469b7b18f9d89aa3294aaa to your computer and use it in GitHub Desktop.
Save ben221199/40e84e2df6469b7b18f9d89aa3294aaa to your computer and use it in GitHub Desktop.

SGML

Useful details about parsing SGML.

Declaration subsets

Recognition modes

There are two recognition modes: DS and DSM:

  • DS - Only recognizable in a declaration subset
  • DSM - Recognizable in a declaration subset or a marked section

So, when the parsers sees a DSO ([), it will join one of the following objects:

  • Declaration subset: The recognition modes are both DS and DSM.
  • Marked section: The recognition mode is only DSM.

Marked sections

Marked sections are objects that are started by a MDO (<!) and a DSO ([) and are ended by a MSC (]]) and a MDC (>). Everything between them is seen a marked section content.

Recognition modes

Marked sections can appear on the same places as a markup declaration, because they share the same starting delimiter. The recognition modes are:

  • CON - Like normal tags, between the content. For example:
    • <a href="#cdata">This is a <![CDATA[link]]> with a marked section.</a>
  • DSM - In a declaration subset or another marked section. For example:
    • Inside a declaration subset: <!DOCTYPE html [ <![CDATA[some data]]> ]>
    • Inside another marked section: <![INCLUDE[ <![CDATA[some data]]> ]]>

Entity

A marked section can also be defined using an entity:

<!ENTITY myMarkedSectionEntity MS "CDATA[my data content">

On a page with &myMarkedSectionEntity;, it will be converted to:

<![CDATA[my data content]]>

Ending

The marked section ends with MSC (]]), directly followed by a MDC (>). This MSC is recognized in the CON mode and the DSM mode.

Why is this? Let's assume we start in CON mode:

  • Mode stack: CON (Content)
  • MDO: <!
  • Mode stack: CON/MD (Markup declaration)
  • DSO: [
  • Mode stack: CON/MD/DSM (Declaration subset OR Marked section)
  • CDATA
  • Mode stack: CON/MD/DSM (Declaration subset OR Marked section)
  • DSO: [ (Likely, this delimiter sets the mode to content.)
  • Mode stack: CON/MD/DSM/CON (Content)
  • some data
  • Mode stack: CON/MD/DSM/CON (Content)
  • MSC: ]]
  • Mode stack: CON/MD (Markup declaration)
  • MDC: >
  • Mode stack: CON (Content)

It is likely that the second DSO ([) of the marked section sets the mode from DSM to CON. When a marked section is started, it should be closed by a MSC (]]), even when the second DSO ([) hasn't appeared yet. Because the first DSO ([) brings the mode to DSM, it would be expected to recognize MSC (]]) only in DSM, but because it can also be recognized in CON, it is likely that the second DSO ([) will bring it to CON.

Keywords

There are 5 described keywords described for marked sections:

  • IGNORE
  • INCLUDE
  • CDATA
  • RCDATA
  • TEMP

The CDATA and RCDATA will recognize the first MSC that will appear. The CDATA wil recognize nothing more, RCDATA only recognizes references on top of that. The IGNORE will only recognize other marked sections to find the right MSC (]]), because those can be nested. The INCLUDE recognizes everything that is allowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment