Note: The following spec is considered a draft and open for discussion.
Content in Tahi is considered data. That data lives in several places: Figures and references are considered external resources that live in separate DB tables. The main content however is expressed in XML using a strict schema. By choosing XML, we can utilize toolchains for automated data processing (e.g XSLT for normalizing content, server side maintenance scripts, etc.).
Goals:
- Pure data representation
- An interface and intermediate format for import/export toolchains (a target format for iHat)
- Optimized for automated processing
- More compact than HTML (we don't need any presentation specific markup such as wrapping divs, clickable links etc.)
- Ensure unique id for every content element (mapping between different data formats - makes creating toolchains much easier)
- Use HTML-compatible markup whenever possible (e.g. for style annotations)
- Easy to convert from and to HTML (e.g. when pasting from clipboard)
- Ability to generate NLM/JATS
- External resources can be included by id: in portable formats they are expanded (e.g. Tahi HTML wrapped in an ePub container)
Example:
<article>
<meta>
<title>The Tahi Article Format</title>
<abstract>Article abstract that can be <strong>annotated</strong></abstract>
<doi>10.1234/myJournal.00001</doi>
<status code="in-review">In review</status>
<creator>userx</creator>
<created-at>2015-05-12T08:20:19.657Z</created-at>
<updated-at>2015-05-13T04:18:42.657Z</updated-at>
</meta>
<body>
<h2 id="h1">A heading</h2>
<p id="p1">
Some <em>annotated</em> content (<ref type="bib" rid="bib1">Doe, 2010</ref>).
</p>
<!-- external content is included by referencing the external id -->
<include type="fig" rid="fig1"/>
</body>
<!-- Resources have external ownership and are extracted from the DB. That means the
resources element is empty in the source file and gets populated on the fly (e.g. each
time when a doc is opened in the editor). That way we avoid storing content redundantly -->
<resources>
<fig id="fig1">
<doi>10.1234/myJournal.00001.001</doi>
<label>Figure 1.</label>
<title>Figure title</title>
<caption>
<em>Annotated</em> Figure caption
</caption>
<image src="fig1.png"/>
</fig>
<bib id="bib1" bib-type="journal-article">
<contributors>
<author id="author1">
<surname>Doe</surname>
<given-names>John</given-names>
</author>
</contributors>
<year>2010</year>
<title>Article <em>X</em></title>
<source>Journal Y</source>
<volume>1</volume>
<fpage>40</fpage>
<lpage>45</lpage>
<doi>10.1234/myJournal.00005</doi>
</bib>
</resources>
</article>
The Tahi HTML format is a complete and human-readable version of the Tahi Article. Meta-information that is not relevant for display is omitted for the most minimal and clean markup. The HTML format is considered a higher level interface for presentation-centric toolchains (e.g. printable output using CSS regions).
Goals:
- Tahi HTML is a generated view on the content rather than an editable source format
- It corresponds to a Tahi HTML spec (WIP) so tools can rely on a certain markup
- A different citation style results in a different HTML presentation
- Considered a portable and CSS-styleable representation of a Tahi article
Example:
<html>
<head>
<title>The Tahi Article Format</title>
</head>
<body>
<div class="section front-matter">
<h1 class="title">The Tahi Article Format</h1>
<div class="authors">
<div class="author">John Doe</div>
</div>
<div class="doi"><a href="http://dx.doi.org/10.1234/myJournal.00001">10.1234/myJournal.00001</a></div>
<div class="abstract">Article abstract that can be <strong>annotated</strong></div>
</div>
<div class="section main-text">
<h2 id="h1">A heading</h2>
<p id="p1">Some <em>annotated</em> content (<a href="#bib1">Doe, 2010</a>).</p>
<!-- Expanded figure -->
<figure id="fig1">
<img src="fig1.png">
<figcaption><span class="label">Figure 1</span><em>Annotated</em> Figure caption</figcaption>
</figure>
</div>
<div class="section references">
<h2>References</h2>
<div class="reference">
<div class="reference" id="bib1">
<span class="label">1.</span> Doe John (2010) Article X. Journal Y 1: 40-45. Available: <a href="http://dx.doi.org/10.1234/myJournal.00005">10.1234/myJournal.00005</a>.
</div>
</div>
</div>
</body>
</html>
@michael: this makes sense to me. And I am a big fan of the Pandoc AST :). But it means that there is an internal Tahi format different from the HTML output, which is conceptually different from using a single HTML format all the way through.