Note: The following spec is considered a draft and open for discussion.
Content in Tahi is considered data. That data lives in several places: Figures and references are considered external resources that live in separate DB tables. The main content however is expressed in XML using a strict schema. By choosing XML, we can utilize toolchains for automated data processing (e.g XSLT for normalizing content, server side maintenance scripts, etc.).
Goals:
- Pure data representation
- An interface and intermediate format for import/export toolchains (a target format for iHat)
- Optimized for automated processing
- More compact than HTML (we don't need any presentation specific markup such as wrapping divs, clickable links etc.)
- Ensure unique id for every content element (mapping between different data formats - makes creating toolchains much easier)
- Use HTML-compatible markup whenever possible (e.g. for style annotations)
- Easy to convert from and to HTML (e.g. when pasting from clipboard)
- Ability to generate NLM/JATS
- External resources can be included by id: in portable formats they are expanded (e.g. Tahi HTML wrapped in an ePub container)
Example:
<article>
<meta>
<title>The Tahi Article Format</title>
<abstract>Article abstract that can be <strong>annotated</strong></abstract>
<doi>10.1234/myJournal.00001</doi>
<status code="in-review">In review</status>
<creator>userx</creator>
<created-at>2015-05-12T08:20:19.657Z</created-at>
<updated-at>2015-05-13T04:18:42.657Z</updated-at>
</meta>
<body>
<h2 id="h1">A heading</h2>
<p id="p1">
Some <em>annotated</em> content (<ref type="bib" rid="bib1">Doe, 2010</ref>).
</p>
<!-- external content is included by referencing the external id -->
<include type="fig" rid="fig1"/>
</body>
<!-- Resources have external ownership and are extracted from the DB. That means the
resources element is empty in the source file and gets populated on the fly (e.g. each
time when a doc is opened in the editor). That way we avoid storing content redundantly -->
<resources>
<fig id="fig1">
<doi>10.1234/myJournal.00001.001</doi>
<label>Figure 1.</label>
<title>Figure title</title>
<caption>
<em>Annotated</em> Figure caption
</caption>
<image src="fig1.png"/>
</fig>
<bib id="bib1" bib-type="journal-article">
<contributors>
<author id="author1">
<surname>Doe</surname>
<given-names>John</given-names>
</author>
</contributors>
<year>2010</year>
<title>Article <em>X</em></title>
<source>Journal Y</source>
<volume>1</volume>
<fpage>40</fpage>
<lpage>45</lpage>
<doi>10.1234/myJournal.00005</doi>
</bib>
</resources>
</article>
The Tahi HTML format is a complete and human-readable version of the Tahi Article. Meta-information that is not relevant for display is omitted for the most minimal and clean markup. The HTML format is considered a higher level interface for presentation-centric toolchains (e.g. printable output using CSS regions).
Goals:
- Tahi HTML is a generated view on the content rather than an editable source format
- It corresponds to a Tahi HTML spec (WIP) so tools can rely on a certain markup
- A different citation style results in a different HTML presentation
- Considered a portable and CSS-styleable representation of a Tahi article
Example:
<html>
<head>
<title>The Tahi Article Format</title>
</head>
<body>
<div class="section front-matter">
<h1 class="title">The Tahi Article Format</h1>
<div class="authors">
<div class="author">John Doe</div>
</div>
<div class="doi"><a href="http://dx.doi.org/10.1234/myJournal.00001">10.1234/myJournal.00001</a></div>
<div class="abstract">Article abstract that can be <strong>annotated</strong></div>
</div>
<div class="section main-text">
<h2 id="h1">A heading</h2>
<p id="p1">Some <em>annotated</em> content (<a href="#bib1">Doe, 2010</a>).</p>
<!-- Expanded figure -->
<figure id="fig1">
<img src="fig1.png">
<figcaption><span class="label">Figure 1</span><em>Annotated</em> Figure caption</figcaption>
</figure>
</div>
<div class="section references">
<h2>References</h2>
<div class="reference">
<div class="reference" id="bib1">
<span class="label">1.</span> Doe John (2010) Article X. Journal Y 1: 40-45. Available: <a href="http://dx.doi.org/10.1234/myJournal.00005">10.1234/myJournal.00005</a>.
</div>
</div>
</div>
</body>
</html>
<header>
and<main>
could be a good idea, however i would not use the<figure>
element because our key-value esque markup inside would violate it's validity.Here's a slighly different variation. I removed the
<head>
part and united all meta properties inside the<header>
element. Still, for me modelling data in HTML just does not feel right.That way all properties can manipulated in the same way. E.g. when a program wants to update the
updated-at
property using the DOM API it would work like this:Saying that i just realized some more problems with server-side processing of HTML:
Given that the Tahi server is built in Ruby it wouldn't it be easier to manipulate XML using Nokogiri or XSLT workflows?
Only focussing on the source format (data), can we re-verify our argumentations about choosing HTML?
HTML
XML:
I'm sorry for insisting so long on considering XML.. but the decision on the data format will have a huge impact on the project in the future... just don't want us to make a mistake here.
Here's how the pure-data HTML format renders in the browser: