This is (yet) another markdown specification. The purpose of this particular markdown spec is to provide a single unambiguous markdown specification. It aims to use formal regular expressions so that implementors can quickly create implementations (given a regex library) and the results will be consistent across platforms. We specify a very small subset of the markdown seen in the wild.
Version 0.1
Character Encoding: UTF-8. We do not support the conversion of HTML entities to symbols.
Roughly speaking the markdown supports some basic formatting like bold, italic, underline. Inline links, images. Bullet lists. It also has multiple levels of section headers. We don't support tables or nested bullet lists.
The document is processed by line by line. Lines are classified by type and processed individually before being joined into a single formatted output. The joining process makes consecutive bullet list lines into a single bulleted list, multiple blockquote lines into a single quoted message, it also makes code blocks from [code block, text, text, ..., code block] lines.
Line types: empty, header, thematic break, bullet point, blockquote, codeblock, paragraph. Every line should be classified by line type, then further parsed. paragraph lines can contain formatting (bold, italic, underline, strikethrough), links and images.
- Empty
Empty line.
- Headers
A header line is one or more # then a space then a nonempty string.
([#]+) (.+)
capture 1 is the header level. capture 2 is the header name.
- Thematic break
A line that's just dashes:
[-]+
This is similar to the
tag in HTML. It's just a horizontal split or line.
- Bullet Points
A bullet point line is one starting with a *.
* (.+)
capture 1 is the bullet text, it will be further processed to support all the formatting and hyper links but no images.
Note: we don't support anything like nested bulleted lists of bulleted lists.
- Blockquote
Lines beginning with >
(.+)
These lines do not have formatting.
XXX Should we support nested of blockquote? like email formatting. I think we probably should.
- Code block
A line that starts ```.
The idea is that any text between two code block lines will be presented in a fixed width font, like program code. The lines between code block lines will not be processed or formatted.
An optional extra for this feature is that a starting code block line might specify the language, e.g. "```java" (github does this).
- Paragraph
Any other kind of line is a paragraph line.
markup control codes:
text italic
text bold
text strikethrough
text underline
text
code
these may be nested and combined. repeating a tag has no effect.
hyperlinks (link) or name imagelinks  footnotes/citation links. ^[text] these get presented as small number links next to the text. And collected up and all listed at the bottom of the page. They can contain links.
In order to be able to write text including uninterpreted control codes we have backslash escaping. "\" in source turns into "" in the result. "\x" in source turns into "x".
XXX what to do about invalid formatting? XXX what about __this type of thing__
Implementations MUST implement all line types and formatting options.
Implementations MUST support 4 levels of headers.
Implementations MAY display images or MAY just display the filename of the image in a box or such.
Implementations MAY produce a table of contents at the start of a page based on the headings.
Implementations MAY performing syntax coloring for code blocks based on the language specifier from the starting code block line.
Implementations MAY hide or not hide the formatting symbols.