Useful details about parsing SGML.
There are two recognition modes: DS
and DSM
:
DS
- Only recognizable in a declaration subsetDSM
- Recognizable in a declaration subset or a marked section
So, when the parsers sees a DSO ([
), it will join one of the following objects:
- Declaration subset: The recognition modes are both
DS
andDSM
. - Marked section: The recognition mode is only
DSM
.
Marked sections are objects that are started by a MDO (<!
) and a DSO ([
) and are ended by a MSC (]]
) and a MDC (>
). Everything between them is seen a marked section content.
Marked sections can appear on the same places as a markup declaration, because they share the same starting delimiter. The recognition modes are:
CON
- Like normal tags, between the content. For example:<a href="#cdata">This is a <![CDATA[link]]> with a marked section.</a>
DSM
- In a declaration subset or another marked section. For example:- Inside a declaration subset:
<!DOCTYPE html [ <![CDATA[some data]]> ]>
- Inside another marked section:
<![INCLUDE[ <![CDATA[some data]]> ]]>
- Inside a declaration subset:
A marked section can also be defined using an entity:
<!ENTITY myMarkedSectionEntity MS "CDATA[my data content">
On a page with &myMarkedSectionEntity;
, it will be converted to:
<![CDATA[my data content]]>
The marked section ends with MSC (]]
), directly followed by a MDC (>
). This MSC is recognized in the CON
mode and the DSM
mode.
Why is this? Let's assume we start in CON
mode:
- Mode stack:
CON
(Content) - MDO:
<!
- Mode stack:
CON
/MD
(Markup declaration) - DSO:
[
- Mode stack:
CON
/MD
/DSM
(Declaration subset OR Marked section) CDATA
- Mode stack:
CON
/MD
/DSM
(Declaration subset OR Marked section) - DSO:
[
(Likely, this delimiter sets the mode to content.) - Mode stack:
CON
/MD
/DSM
/CON
(Content) some data
- Mode stack:
CON
/MD
/DSM
/CON
(Content) - MSC:
]]
- Mode stack:
CON
/MD
(Markup declaration) - MDC:
>
- Mode stack:
CON
(Content)
It is likely that the second DSO ([
) of the marked section sets the mode from DSM
to CON
. When a marked section is started, it should be closed by a MSC (]]
), even when the second DSO ([
) hasn't appeared yet. Because the first DSO ([
) brings the mode to DSM
, it would be expected to recognize MSC (]]
) only in DSM
, but because it can also be recognized in CON
, it is likely that the second DSO ([
) will bring it to CON
.
There are 5 described keywords described for marked sections:
IGNORE
INCLUDE
CDATA
RCDATA
TEMP
The CDATA
and RCDATA
will recognize the first MSC
that will appear. The CDATA
wil recognize nothing more, RCDATA
only recognizes references on top of that. The IGNORE
will only recognize other marked sections to find the right MSC (]]
), because those can be nested. The INCLUDE
recognizes everything that is allowed.