Note: The following spec is considered a draft and open for discussion.
Content in Tahi is considered data. That data lives in several places: Figures and references are considered external resources that live in separate DB tables. The main content however is expressed in XML using a strict schema. By choosing XML, we can utilize toolchains for automated data processing (e.g XSLT for normalizing content, server side maintenance scripts, etc.).
Goals:
- Pure data representation
- An interface and intermediate format for import/export toolchains (a target format for iHat)
- Optimized for automated processing
- More compact than HTML (we don't need any presentation specific markup such as wrapping divs, clickable links etc.)
- Ensure unique id for every content element (mapping between different data formats - makes creating toolchains much easier)
- Use HTML-compatible markup whenever possible (e.g. for style annotations)
- Easy to convert from and to HTML (e.g. when pasting from clipboard)
- Ability to generate NLM/JATS
- External resources can be included by id: in portable formats they are expanded (e.g. Tahi HTML wrapped in an ePub container)
Example:
<article>
<meta>
<title>The Tahi Article Format</title>
<abstract>Article abstract that can be <strong>annotated</strong></abstract>
<doi>10.1234/myJournal.00001</doi>
<status code="in-review">In review</status>
<creator>userx</creator>
<created-at>2015-05-12T08:20:19.657Z</created-at>
<updated-at>2015-05-13T04:18:42.657Z</updated-at>
</meta>
<body>
<h2 id="h1">A heading</h2>
<p id="p1">
Some <em>annotated</em> content (<ref type="bib" rid="bib1">Doe, 2010</ref>).
</p>
<!-- external content is included by referencing the external id -->
<include type="fig" rid="fig1"/>
</body>
<!-- Resources have external ownership and are extracted from the DB. That means the
resources element is empty in the source file and gets populated on the fly (e.g. each
time when a doc is opened in the editor). That way we avoid storing content redundantly -->
<resources>
<fig id="fig1">
<doi>10.1234/myJournal.00001.001</doi>
<label>Figure 1.</label>
<title>Figure title</title>
<caption>
<em>Annotated</em> Figure caption
</caption>
<image src="fig1.png"/>
</fig>
<bib id="bib1" bib-type="journal-article">
<contributors>
<author id="author1">
<surname>Doe</surname>
<given-names>John</given-names>
</author>
</contributors>
<year>2010</year>
<title>Article <em>X</em></title>
<source>Journal Y</source>
<volume>1</volume>
<fpage>40</fpage>
<lpage>45</lpage>
<doi>10.1234/myJournal.00005</doi>
</bib>
</resources>
</article>
The Tahi HTML format is a complete and human-readable version of the Tahi Article. Meta-information that is not relevant for display is omitted for the most minimal and clean markup. The HTML format is considered a higher level interface for presentation-centric toolchains (e.g. printable output using CSS regions).
Goals:
- Tahi HTML is a generated view on the content rather than an editable source format
- It corresponds to a Tahi HTML spec (WIP) so tools can rely on a certain markup
- A different citation style results in a different HTML presentation
- Considered a portable and CSS-styleable representation of a Tahi article
Example:
<html>
<head>
<title>The Tahi Article Format</title>
</head>
<body>
<div class="section front-matter">
<h1 class="title">The Tahi Article Format</h1>
<div class="authors">
<div class="author">John Doe</div>
</div>
<div class="doi"><a href="http://dx.doi.org/10.1234/myJournal.00001">10.1234/myJournal.00001</a></div>
<div class="abstract">Article abstract that can be <strong>annotated</strong></div>
</div>
<div class="section main-text">
<h2 id="h1">A heading</h2>
<p id="p1">Some <em>annotated</em> content (<a href="#bib1">Doe, 2010</a>).</p>
<!-- Expanded figure -->
<figure id="fig1">
<img src="fig1.png">
<figcaption><span class="label">Figure 1</span><em>Annotated</em> Figure caption</figcaption>
</figure>
</div>
<div class="section references">
<h2>References</h2>
<div class="reference">
<div class="reference" id="bib1">
<span class="label">1.</span> Doe John (2010) Article X. Journal Y 1: 40-45. Available: <a href="http://dx.doi.org/10.1234/myJournal.00005">10.1234/myJournal.00005</a>.
</div>
</div>
</div>
</body>
</html>
Looks clean to me. Im still not entirely clear though, why we need (1) when it can also expressed, without loss, in (2). ie. why use XML when we can represent exactly the same thing in HTML (as demonstrated)...
The only advantage I see in the list of requirements that points to XML is the first item:
Pure data representation
That 'leads' to XML (supposedly). All the other points are what we can do with the article once it is HTML...so, it begs the question to me - what is the advantage of XML here? Currently I don't see it...especially when there are good tools for HTML validation etc and if we write good code for constructing HTML then it is sure to conform to our needs...adding XML actually adds to the complexity by requiring an XML<->HTML conversion which feels unnecessary and an area which can be potentially error prone.
I do think the structure is more readable in (1) however. Which is perhaps ironic as that is the machine readable version ;) How about just using HTML and exploring ways to make the structure cleaner?