tahi-spec.md

The Tahi Article Format

Note: The following spec is considered a draft and open for discussion.

Tahi Source XML

Content in Tahi is considered data. That data lives in several places: Figures and references are considered external resources that live in separate DB tables. The main content however is expressed in XML using a strict schema. By choosing XML, we can utilize toolchains for automated data processing (e.g XSLT for normalizing content, server side maintenance scripts, etc.).

Goals:

Pure data representation
An interface and intermediate format for import/export toolchains (a target format for iHat)
Optimized for automated processing
More compact than HTML (we don't need any presentation specific markup such as wrapping divs, clickable links etc.)
Ensure unique id for every content element (mapping between different data formats - makes creating toolchains much easier)
Use HTML-compatible markup whenever possible (e.g. for style annotations)
Easy to convert from and to HTML (e.g. when pasting from clipboard)
Ability to generate NLM/JATS
External resources can be included by id: in portable formats they are expanded (e.g. Tahi HTML wrapped in an ePub container)

Example:

<article>
  <meta>
    <title>The Tahi Article Format</title>
    <abstract>Article abstract that can be <strong>annotated</strong></abstract>
    <doi>10.1234/myJournal.00001</doi>
    <status code="in-review">In review</status>
    <creator>userx</creator>
    <created-at>2015-05-12T08:20:19.657Z</created-at>
    <updated-at>2015-05-13T04:18:42.657Z</updated-at>
  </meta>
  <body>
    <h2 id="h1">A heading</h2>
    <p id="p1">
      Some <em>annotated</em> content (<ref type="bib" rid="bib1">Doe, 2010</ref>).
    </p>
    <!-- external content is included by referencing the external id -->
    <include type="fig" rid="fig1"/>
  </body>
  <!-- Resources have external ownership and are extracted from the DB. That means the
  resources element is empty in the source file and gets populated on the fly (e.g. each
  time when a doc is opened in the editor). That way we avoid storing content redundantly -->
  <resources>
    <fig id="fig1">
      <doi>10.1234/myJournal.00001.001</doi>
      <label>Figure 1.</label>
      <title>Figure title</title>
      <caption>
        <em>Annotated</em> Figure caption
      </caption>
      <image src="fig1.png"/>
    </fig>
    <bib id="bib1" bib-type="journal-article">
      <contributors>
        <author id="author1">
          <surname>Doe</surname>
          <given-names>John</given-names>
        </author>
      </contributors>
      <year>2010</year>
      <title>Article <em>X</em></title>
      <source>Journal Y</source>
      <volume>1</volume>
      <fpage>40</fpage>
      <lpage>45</lpage>
      <doi>10.1234/myJournal.00005</doi>
    </bib>
  </resources>
</article>

Tahi HTML

The Tahi HTML format is a complete and human-readable version of the Tahi Article. Meta-information that is not relevant for display is omitted for the most minimal and clean markup. The HTML format is considered a higher level interface for presentation-centric toolchains (e.g. printable output using CSS regions).

Goals:

Tahi HTML is a generated view on the content rather than an editable source format
It corresponds to a Tahi HTML spec (WIP) so tools can rely on a certain markup
A different citation style results in a different HTML presentation
Considered a portable and CSS-styleable representation of a Tahi article

Example:

<html>
<head>
  <title>The Tahi Article Format</title>
</head>
<body>
  <div class="section front-matter">
    <h1 class="title">The Tahi Article Format</h1>
    <div class="authors">
      <div class="author">John Doe</div>
    </div>
    <div class="doi"><a href="http://dx.doi.org/10.1234/myJournal.00001">10.1234/myJournal.00001</a></div>
    <div class="abstract">Article abstract that can be <strong>annotated</strong></div>
  </div>
  <div class="section main-text">
    <h2 id="h1">A heading</h2>
    <p id="p1">Some <em>annotated</em> content (<a href="#bib1">Doe, 2010</a>).</p>
    <!-- Expanded figure -->
    <figure id="fig1">
      <img src="fig1.png">  
      <figcaption><span class="label">Figure 1</span><em>Annotated</em> Figure caption</figcaption>
    </figure>
  </div>
  <div class="section references">
    <h2>References</h2>
    <div class="reference">
      <div class="reference" id="bib1">
        <span class="label">1.</span> Doe John (2010) Article X. Journal Y 1: 40-45. Available: <a href="http://dx.doi.org/10.1234/myJournal.00005">10.1234/myJournal.00005</a>.
      </div>
    </div>
  </div>
</body>
</html>

I also thought of JATS and being more strict than the spec allows. But I see disadvantages in the hierarchical nature, so we can not use it as a random-access like model (see doc.set api shown above), which would be really handy and expressive.

I could see JATS as an option, though we would need query selectors and some extra markup for data manipulation in many cases. Much better than the HTML option, but we could get more.

I think we should not be afraid of defining a minimal data format tailored for Tahi purposes, so it's most easy to work with from all Tahi subsystems. Then we write generators for JATS, HTML, ePub etc. which we communicate to the public. It's REALLY trivial to implement those, it's just a projection of data to markup.

I just want to bring as an example that Pandoc uses exactly that strategy to deal with all those different input and output formats. There's an internal data format, the Pandoc AST, that is optimized to be most useful for pandoc readers and writers implementations. It's not communicated as an exchange format. We could do the same with a Tahi Source XML (see very first draft). It will be perfectly tailored to the nature of our data, and thus make writing tahi components much much easier, since we don't have to deal with redundancy and hierarchy. We can then consider Tahi Articles like little databases that we can query and update without thinking about the serialization and dealing with redundancies.

Set paragraph content: doc.set("paragraph_4.content", "hello <strong>world</strong>")
Set header level: doc.set("heading_10.level", 2)
Delete paragraph: doc.delete("paragraph_9")
Create a figure: doc.create({id: "fig_10", type: "fig", caption: "my <em>caption</em>", url: "fig_14.png"})

And with each change we make, we immediately can output the corresponding JATS, HTML, etc.

michael/tahi-spec.md

The Tahi Article Format

Tahi Source XML

Tahi HTML

michael commented May 21, 2015

mfenner commented May 21, 2015

jure commented May 26, 2015