Skip to content

Instantly share code, notes, and snippets.

@michael
Last active January 29, 2017 21:43
Show Gist options
  • Save michael/d417c1d034183e567db6 to your computer and use it in GitHub Desktop.
Save michael/d417c1d034183e567db6 to your computer and use it in GitHub Desktop.

The Tahi Article Format

Note: The following spec is considered a draft and open for discussion.

Tahi Source XML

Content in Tahi is considered data. That data lives in several places: Figures and references are considered external resources that live in separate DB tables. The main content however is expressed in XML using a strict schema. By choosing XML, we can utilize toolchains for automated data processing (e.g XSLT for normalizing content, server side maintenance scripts, etc.).

Goals:

  • Pure data representation
  • An interface and intermediate format for import/export toolchains (a target format for iHat)
  • Optimized for automated processing
  • More compact than HTML (we don't need any presentation specific markup such as wrapping divs, clickable links etc.)
  • Ensure unique id for every content element (mapping between different data formats - makes creating toolchains much easier)
  • Use HTML-compatible markup whenever possible (e.g. for style annotations)
  • Easy to convert from and to HTML (e.g. when pasting from clipboard)
  • Ability to generate NLM/JATS
  • External resources can be included by id: in portable formats they are expanded (e.g. Tahi HTML wrapped in an ePub container)

Example:

<article>
  <meta>
    <title>The Tahi Article Format</title>
    <abstract>Article abstract that can be <strong>annotated</strong></abstract>
    <doi>10.1234/myJournal.00001</doi>
    <status code="in-review">In review</status>
    <creator>userx</creator>
    <created-at>2015-05-12T08:20:19.657Z</created-at>
    <updated-at>2015-05-13T04:18:42.657Z</updated-at>
  </meta>
  <body>
    <h2 id="h1">A heading</h2>
    <p id="p1">
      Some <em>annotated</em> content (<ref type="bib" rid="bib1">Doe, 2010</ref>).
    </p>
    <!-- external content is included by referencing the external id -->
    <include type="fig" rid="fig1"/>
  </body>
  <!-- Resources have external ownership and are extracted from the DB. That means the
  resources element is empty in the source file and gets populated on the fly (e.g. each
  time when a doc is opened in the editor). That way we avoid storing content redundantly -->
  <resources>
    <fig id="fig1">
      <doi>10.1234/myJournal.00001.001</doi>
      <label>Figure 1.</label>
      <title>Figure title</title>
      <caption>
        <em>Annotated</em> Figure caption
      </caption>
      <image src="fig1.png"/>
    </fig>
    <bib id="bib1" bib-type="journal-article">
      <contributors>
        <author id="author1">
          <surname>Doe</surname>
          <given-names>John</given-names>
        </author>
      </contributors>
      <year>2010</year>
      <title>Article <em>X</em></title>
      <source>Journal Y</source>
      <volume>1</volume>
      <fpage>40</fpage>
      <lpage>45</lpage>
      <doi>10.1234/myJournal.00005</doi>
    </bib>
  </resources>
</article>

Tahi HTML

The Tahi HTML format is a complete and human-readable version of the Tahi Article. Meta-information that is not relevant for display is omitted for the most minimal and clean markup. The HTML format is considered a higher level interface for presentation-centric toolchains (e.g. printable output using CSS regions).

Goals:

  • Tahi HTML is a generated view on the content rather than an editable source format
  • It corresponds to a Tahi HTML spec (WIP) so tools can rely on a certain markup
  • A different citation style results in a different HTML presentation
  • Considered a portable and CSS-styleable representation of a Tahi article

Example:

<html>
<head>
  <title>The Tahi Article Format</title>
</head>
<body>
  <div class="section front-matter">
    <h1 class="title">The Tahi Article Format</h1>
    <div class="authors">
      <div class="author">John Doe</div>
    </div>
    <div class="doi"><a href="http://dx.doi.org/10.1234/myJournal.00001">10.1234/myJournal.00001</a></div>
    <div class="abstract">Article abstract that can be <strong>annotated</strong></div>
  </div>
  <div class="section main-text">
    <h2 id="h1">A heading</h2>
    <p id="p1">Some <em>annotated</em> content (<a href="#bib1">Doe, 2010</a>).</p>
    <!-- Expanded figure -->
    <figure id="fig1">
      <img src="fig1.png">  
      <figcaption><span class="label">Figure 1</span><em>Annotated</em> Figure caption</figcaption>
    </figure>
  </div>
  <div class="section references">
    <h2>References</h2>
    <div class="reference">
      <div class="reference" id="bib1">
        <span class="label">1.</span> Doe John (2010) Article X. Journal Y 1: 40-45. Available: <a href="http://dx.doi.org/10.1234/myJournal.00005">10.1234/myJournal.00005</a>.
      </div>
    </div>
  </div>
</body>
</html>
@michael
Copy link
Author

michael commented May 21, 2015

I also thought of JATS and being more strict than the spec allows. But I see disadvantages in the hierarchical nature, so we can not use it as a random-access like model (see doc.set api shown above), which would be really handy and expressive.

I could see JATS as an option, though we would need query selectors and some extra markup for data manipulation in many cases. Much better than the HTML option, but we could get more.

I think we should not be afraid of defining a minimal data format tailored for Tahi purposes, so it's most easy to work with from all Tahi subsystems. Then we write generators for JATS, HTML, ePub etc. which we communicate to the public. It's REALLY trivial to implement those, it's just a projection of data to markup.

I just want to bring as an example that Pandoc uses exactly that strategy to deal with all those different input and output formats. There's an internal data format, the Pandoc AST, that is optimized to be most useful for pandoc readers and writers implementations. It's not communicated as an exchange format. We could do the same with a Tahi Source XML (see very first draft). It will be perfectly tailored to the nature of our data, and thus make writing tahi components much much easier, since we don't have to deal with redundancy and hierarchy. We can then consider Tahi Articles like little databases that we can query and update without thinking about the serialization and dealing with redundancies.

  • Set paragraph content: doc.set("paragraph_4.content", "hello <strong>world</strong>")
  • Set header level: doc.set("heading_10.level", 2)
  • Delete paragraph: doc.delete("paragraph_9")
  • Create a figure: doc.create({id: "fig_10", type: "fig", caption: "my <em>caption</em>", url: "fig_14.png"})

And with each change we make, we immediately can output the corresponding JATS, HTML, etc.

@mfenner
Copy link

mfenner commented May 21, 2015

@michael: this makes sense to me. And I am a big fan of the Pandoc AST :). But it means that there is an internal Tahi format different from the HTML output, which is conceptually different from using a single HTML format all the way through.

@jure
Copy link

jure commented May 26, 2015

Note to self, clicking on the star button in GitHub before submitting your comment will result in total loss of content. So here I go again.

I can see the benefits of two formats, one for source and one for presentation, but if the redundancy in HTML's one format scenarios is only these few situations, i.e. the title, citations, and author, that's likely less redundant than using two formats for what can be achieved in one. And that's ignoring the religiousness of Tahi's HTML promise, which should not be easily ignored.

To me that HTML example that you provided looks OK, and if the issue is that every component will have to address these redundancies, perhaps that could be addressed with a single point of contact with the HTML, an HTML writer module, which deals with redundancies in one place.

Additionally, I went digging for thoughts on how to store JSON in data attributes, and it's very straightforward. The HTML spec allows for single quoting of attributes, which means you can do something like this:

<div id="awesome-json" data-awesome='{"game":"on"}'></div> 

And then access it very elegantly like so:

var gameStatus= jQuery("#awesome-json").data('awesome').game;

In summary, my 2 cents is in favor of a one format scenario, with source data stored in the HTML using JSON in data attributes. I feel like this could be done very elegantly as well and would be less redundant than two formats. In essence, both solutions can be elegant, and we also consider, on top of elegance and correctness, that the Tahi HTML promise is an important one, the scale tips in favor of HTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment