Note: The following spec is considered a draft and open for discussion.
Content in Tahi is considered data. That data lives in several places: Figures and references are considered external resources that live in separate DB tables. The main content however is expressed in XML using a strict schema. By choosing XML, we can utilize toolchains for automated data processing (e.g XSLT for normalizing content, server side maintenance scripts, etc.).
Goals:
- Pure data representation
- An interface and intermediate format for import/export toolchains (a target format for iHat)
- Optimized for automated processing
- More compact than HTML (we don't need any presentation specific markup such as wrapping divs, clickable links etc.)
- Ensure unique id for every content element (mapping between different data formats - makes creating toolchains much easier)
- Use HTML-compatible markup whenever possible (e.g. for style annotations)
- Easy to convert from and to HTML (e.g. when pasting from clipboard)
- Ability to generate NLM/JATS
- External resources can be included by id: in portable formats they are expanded (e.g. Tahi HTML wrapped in an ePub container)
Example:
<article>
<meta>
<title>The Tahi Article Format</title>
<abstract>Article abstract that can be <strong>annotated</strong></abstract>
<doi>10.1234/myJournal.00001</doi>
<status code="in-review">In review</status>
<creator>userx</creator>
<created-at>2015-05-12T08:20:19.657Z</created-at>
<updated-at>2015-05-13T04:18:42.657Z</updated-at>
</meta>
<body>
<h2 id="h1">A heading</h2>
<p id="p1">
Some <em>annotated</em> content (<ref type="bib" rid="bib1">Doe, 2010</ref>).
</p>
<!-- external content is included by referencing the external id -->
<include type="fig" rid="fig1"/>
</body>
<!-- Resources have external ownership and are extracted from the DB. That means the
resources element is empty in the source file and gets populated on the fly (e.g. each
time when a doc is opened in the editor). That way we avoid storing content redundantly -->
<resources>
<fig id="fig1">
<doi>10.1234/myJournal.00001.001</doi>
<label>Figure 1.</label>
<title>Figure title</title>
<caption>
<em>Annotated</em> Figure caption
</caption>
<image src="fig1.png"/>
</fig>
<bib id="bib1" bib-type="journal-article">
<contributors>
<author id="author1">
<surname>Doe</surname>
<given-names>John</given-names>
</author>
</contributors>
<year>2010</year>
<title>Article <em>X</em></title>
<source>Journal Y</source>
<volume>1</volume>
<fpage>40</fpage>
<lpage>45</lpage>
<doi>10.1234/myJournal.00005</doi>
</bib>
</resources>
</article>
The Tahi HTML format is a complete and human-readable version of the Tahi Article. Meta-information that is not relevant for display is omitted for the most minimal and clean markup. The HTML format is considered a higher level interface for presentation-centric toolchains (e.g. printable output using CSS regions).
Goals:
- Tahi HTML is a generated view on the content rather than an editable source format
- It corresponds to a Tahi HTML spec (WIP) so tools can rely on a certain markup
- A different citation style results in a different HTML presentation
- Considered a portable and CSS-styleable representation of a Tahi article
Example:
<html>
<head>
<title>The Tahi Article Format</title>
</head>
<body>
<div class="section front-matter">
<h1 class="title">The Tahi Article Format</h1>
<div class="authors">
<div class="author">John Doe</div>
</div>
<div class="doi"><a href="http://dx.doi.org/10.1234/myJournal.00001">10.1234/myJournal.00001</a></div>
<div class="abstract">Article abstract that can be <strong>annotated</strong></div>
</div>
<div class="section main-text">
<h2 id="h1">A heading</h2>
<p id="p1">Some <em>annotated</em> content (<a href="#bib1">Doe, 2010</a>).</p>
<!-- Expanded figure -->
<figure id="fig1">
<img src="fig1.png">
<figcaption><span class="label">Figure 1</span><em>Annotated</em> Figure caption</figcaption>
</figure>
</div>
<div class="section references">
<h2>References</h2>
<div class="reference">
<div class="reference" id="bib1">
<span class="label">1.</span> Doe John (2010) Article X. Journal Y 1: 40-45. Available: <a href="http://dx.doi.org/10.1234/myJournal.00005">10.1234/myJournal.00005</a>.
</div>
</div>
</div>
</body>
</html>
It actually can not be expressed in (2) without loss, also not without redundancy. Here's some examples:
<title>
element in the head and an<h1>
that is actually rendered.Hehe, I guess you read like a machine then. ;) Well important is that we have a pure data source without redundancy and excluding externally owned pieces. Imo this raw data does not need to be displayed, as you would not display the contents of a raw database row to the user.
I'm 100% sure that we need to draw a distinction between data and presentation, as Riz and Oliver confirmed. While it's pretty clear to us how the presentation format will look like, we need to the define the source format. For modelling data, XML just came too naturally.
For comparison, I've modelled the source format in HTML: