Skip to content

Instantly share code, notes, and snippets.

@michael
Last active January 29, 2017 21:43
Show Gist options
  • Save michael/d417c1d034183e567db6 to your computer and use it in GitHub Desktop.
Save michael/d417c1d034183e567db6 to your computer and use it in GitHub Desktop.

The Tahi Article Format

Note: The following spec is considered a draft and open for discussion.

Tahi Source XML

Content in Tahi is considered data. That data lives in several places: Figures and references are considered external resources that live in separate DB tables. The main content however is expressed in XML using a strict schema. By choosing XML, we can utilize toolchains for automated data processing (e.g XSLT for normalizing content, server side maintenance scripts, etc.).

Goals:

  • Pure data representation
  • An interface and intermediate format for import/export toolchains (a target format for iHat)
  • Optimized for automated processing
  • More compact than HTML (we don't need any presentation specific markup such as wrapping divs, clickable links etc.)
  • Ensure unique id for every content element (mapping between different data formats - makes creating toolchains much easier)
  • Use HTML-compatible markup whenever possible (e.g. for style annotations)
  • Easy to convert from and to HTML (e.g. when pasting from clipboard)
  • Ability to generate NLM/JATS
  • External resources can be included by id: in portable formats they are expanded (e.g. Tahi HTML wrapped in an ePub container)

Example:

<article>
  <meta>
    <title>The Tahi Article Format</title>
    <abstract>Article abstract that can be <strong>annotated</strong></abstract>
    <doi>10.1234/myJournal.00001</doi>
    <status code="in-review">In review</status>
    <creator>userx</creator>
    <created-at>2015-05-12T08:20:19.657Z</created-at>
    <updated-at>2015-05-13T04:18:42.657Z</updated-at>
  </meta>
  <body>
    <h2 id="h1">A heading</h2>
    <p id="p1">
      Some <em>annotated</em> content (<ref type="bib" rid="bib1">Doe, 2010</ref>).
    </p>
    <!-- external content is included by referencing the external id -->
    <include type="fig" rid="fig1"/>
  </body>
  <!-- Resources have external ownership and are extracted from the DB. That means the
  resources element is empty in the source file and gets populated on the fly (e.g. each
  time when a doc is opened in the editor). That way we avoid storing content redundantly -->
  <resources>
    <fig id="fig1">
      <doi>10.1234/myJournal.00001.001</doi>
      <label>Figure 1.</label>
      <title>Figure title</title>
      <caption>
        <em>Annotated</em> Figure caption
      </caption>
      <image src="fig1.png"/>
    </fig>
    <bib id="bib1" bib-type="journal-article">
      <contributors>
        <author id="author1">
          <surname>Doe</surname>
          <given-names>John</given-names>
        </author>
      </contributors>
      <year>2010</year>
      <title>Article <em>X</em></title>
      <source>Journal Y</source>
      <volume>1</volume>
      <fpage>40</fpage>
      <lpage>45</lpage>
      <doi>10.1234/myJournal.00005</doi>
    </bib>
  </resources>
</article>

Tahi HTML

The Tahi HTML format is a complete and human-readable version of the Tahi Article. Meta-information that is not relevant for display is omitted for the most minimal and clean markup. The HTML format is considered a higher level interface for presentation-centric toolchains (e.g. printable output using CSS regions).

Goals:

  • Tahi HTML is a generated view on the content rather than an editable source format
  • It corresponds to a Tahi HTML spec (WIP) so tools can rely on a certain markup
  • A different citation style results in a different HTML presentation
  • Considered a portable and CSS-styleable representation of a Tahi article

Example:

<html>
<head>
  <title>The Tahi Article Format</title>
</head>
<body>
  <div class="section front-matter">
    <h1 class="title">The Tahi Article Format</h1>
    <div class="authors">
      <div class="author">John Doe</div>
    </div>
    <div class="doi"><a href="http://dx.doi.org/10.1234/myJournal.00001">10.1234/myJournal.00001</a></div>
    <div class="abstract">Article abstract that can be <strong>annotated</strong></div>
  </div>
  <div class="section main-text">
    <h2 id="h1">A heading</h2>
    <p id="p1">Some <em>annotated</em> content (<a href="#bib1">Doe, 2010</a>).</p>
    <!-- Expanded figure -->
    <figure id="fig1">
      <img src="fig1.png">  
      <figcaption><span class="label">Figure 1</span><em>Annotated</em> Figure caption</figcaption>
    </figure>
  </div>
  <div class="section references">
    <h2>References</h2>
    <div class="reference">
      <div class="reference" id="bib1">
        <span class="label">1.</span> Doe John (2010) Article X. Journal Y 1: 40-45. Available: <a href="http://dx.doi.org/10.1234/myJournal.00005">10.1234/myJournal.00005</a>.
      </div>
    </div>
  </div>
</body>
</html>
@esetera
Copy link

esetera commented May 14, 2015

I'm quite happy with that HTML representation as the data format.

@mikem
Copy link

mikem commented May 14, 2015

I think we can lean on HTML5 elements for this representation to make it look more natural. For example:

  • <div data-type="meta"> -> <header>
  • <div data-type="body"> -> <main>
  • <div data-type="fig" id="fig1"> -> <figure id="fig1"> (and include a <figcaption> tag)

Why not include the resources inline, where they appear in the document?

Why not use regular <a> tags for references?

@michael
Copy link
Author

michael commented May 15, 2015

<header> and <main> could be a good idea, however i would not use the <figure> element because our key-value esque markup inside would violate it's validity.

Here's a slighly different variation. I removed the <head> part and united all meta properties inside the <header> element. Still, for me modelling data in HTML just does not feel right.

<html>
  <body>
    <header id="meta">
      <div data-type="title" id="title">The Tahi Article Format</div>  
      <div data-type="abstract" id="abstract">Article abstract that can be <strong>annotated</strong></div>
      <div data-type="doi">10.1234/myJournal.00001</div>
      <div data-type="status">in-review</div>
      <div data-type="creator">userx</div>
      <div data-type="created-at">2015-05-12T08:20:19.657Z</div>
      <div data-type="updated-at">2015-05-13T04:18:42.657Z</div>
    </header>
    <main id="content">
      <h2 id="h1">A heading</h2>
      <p id="p1">
        Some <em>annotated</em> content (<span data-type="ref" data-ref-type="bib" data-rid="bib1">Doe, 2010</span>).
      </p>
      <!-- external content is included by referencing the external id -->
      <div data-type="include" data-include-type="fig" data-rid="fig1"/>
    </main>
    <!-- Resources have external ownership and are extracted from the DB. That means the
    resources element is empty in the source file and gets populated on the fly (e.g. each
    time when a doc is opened in the editor). That way we avoid storing content redundantly -->
    <footer id="resources">
      <div data-type="fig" id="fig1">
        <div data-type="doi">10.1234/myJournal.00001.001</div>
        <div data-type="label">Figure 1.</div>
        <div data-type="title">Figure title</div>
        <div data-type="caption">
          <em>Annotated</em> Figure caption
        </div>
        <!-- I did not choose the img tag here because the source format should only has an image
        identifier, not the fully qualified url, so we can change where images are hosted without touching the source files -->
        <div data-type="image" data-image-name="fig1_image_a"/>
      </div>
      <div data-type="bib" id="bib1">
        <div data-type="contributors">
          <div data-type="author" id="author1">
            <div data-type="surname">Doe</div>
            <div data-type="given-names">John</div>
          </div>
        </div>
        <div data-type="year">2010</div>
        <div data-type="title">Article <em>X</em></div>
        <div data-type="source">Journal Y</div>
        <div data-type="volume">1</div>
        <div data-type="fpage">40</div>
        <div data-type="lpage">45</div>
        <div data-type="doi">10.1234/myJournal.00005</div>
      </div>
    </footer>
  </body>
</html>

That way all properties can manipulated in the same way. E.g. when a program wants to update the updated-at property using the DOM API it would work like this:

var updatedAtEl = document.getElementById("meta div[data-type=updated-at]");
updatedAtEl.textContent = "2015-05-15T16:34:27.373Z";

Saying that i just realized some more problems with server-side processing of HTML:

  • The code above only works well in browser environments
  • On the server you need to rely on implementations such as JSDOM to use the browser's DOM API

Given that the Tahi server is built in Ruby it wouldn't it be easier to manipulate XML using Nokogiri or XSLT workflows?

Only focussing on the source format (data), can we re-verify our argumentations about choosing HTML?

HTML

Definition: HyperText Markup Language, commonly referred to as HTML, is the standard markup language used to create web pages.

  • can be viewed in the browser (see screenshot below)
  • ... what more?

XML:

Definition: Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable

  • less markup
  • easy validation using dtd, XML-Schema
  • much better tool support for automated processing (outside of the browser)
  • designed to model domain-specific data structures

I'm sorry for insisting so long on considering XML.. but the decision on the data format will have a huge impact on the project in the future... just don't want us to make a mistake here.

Here's how the pure-data HTML format renders in the browser:

@mikem
Copy link

mikem commented May 18, 2015

Processing HTML on the server side with Nokogiri shouldn't be any more difficult than XML.

I'm still curious, though: why not include resources inline?

@michael
Copy link
Author

michael commented May 19, 2015

They can not be inline because they are not owned by the source file (live in the db). Instead they are included, so they can be dynamically expanded when we serve the presentation HTML.

Actually the pure source format does not have the resources at all.. I just modelled them how they would look when generated.

@michael
Copy link
Author

michael commented May 20, 2015

As discussed in our IPM this week, here's the Tahi HTML with semantic annotations that combines data and presentation in one file.

<html>
<head>
  <!-- redundancy: same as h1.title, but witout annotations -->
  <title data-type="title">The Tahi Article Format</title>
</head>
<body>
  <div data-type="meta" class="section front-matter">
    <h1 class="title" data-type="title">The <em>Tahi</em> Article Format</h1>
    <div class="authors" data-type="authors">
      <div data-type="author" class="author">John Doe</div>
    </div>
    <!-- redundancy: Plain DOI vs dx.doi.org url prefix -->
    <div data-type="doi" class="doi">
      <a href="http://dx.doi.org/10.1234/myJournal.00001">10.1234/myJournal.00001</a>
    </div>
    <div data-type="abstract" class="abstract">Article abstract that can be <strong>annotated</strong></div>
  </div>
  <div data-type="main-text" class="section main-text">
    <h2 id="h1">A heading</h2>
    <p id="p1">Some <em>annotated</em> content (<a href="#bib1" data-type="ref" data-ref-type="bib-ref" data-ref-target="bib1">Doe, 2010</a>).</p>
    <!-- problem: figures are owned by the db and can not be modfied through the html -->
    <figure data-type="fig" id="fig1">
      <img data-type="url" src="fig1.png"/>
      <figcaption data-type="caption"><span class="label">Figure 1</span><em>Annotated</em> Figure caption</figcaption>
    </figure>
  </div>
  <div data-type="references" class="section references">
    <h2>References</h2>
    <!-- problem: references are owned by the db and can not be modfied through the html -->
    <div class="references">
      <div data-type="bib" class="reference" id="bib1">
        <!-- problem: doi is problematic to update, as the dx.doi.org url must also be adjusted accordingly -->
        <span class="label">1.</span> <span data-type="surename">Doe</span> <span data-type="given-names">John</span> (<span data-type="year">2010</span>) <span class="title">Article <em>X</em></span>. <span data-type="source">Journal Y</span> <span data-type="volume">1</span>: <span data-type="fpage">40</span>-<span data-type="lpage">45</span>. Available: <span data-type="doi" class="doi"><a href="http://dx.doi.org/10.1234/myJournal.00005" >10.1234/myJournal.00005</a></span>.
      </div>
    </div>
  </div>
</body>
</html>

My thoughts:

  • updating this structure is very hard because of redundancy
  • e.g. when we want to update the article title we need to:
    • update h1.title with annotated title
    • derive a plain text version of the title
    • store this plain text version on the title tag with is in the head
  • the article doi is also hard to change because of redundancy:
  • file is hard read: the presentation part is pulluted with semantic tags, while the data part is pulluted with style tags

For illustration:

If we have a pure data format we can manipulate docs in a object key-value fashion like this:

doc.set("meta.title", "some <em>annotated title</em>");
doc.set("bib1.doi", "10.1234/myJournal.00001");

If we use hierarchical HTML (as specified) we always have to query the DOM for the right elements, and deal with redundancy.

var titleEl = document.querySelector("*[data-type=meta] *[data-type=title]");
titleEl.innerHTML = "some <em>annotated title</em>";
// set the title and use the plain text content
document.title = titleEl.textContent;

I would suggest a pragmatic approach:

  • we could go with a pure data format (XML) internally
  • keep this a secret (we don't need to communicate this as an interface)
  • what people will see is only the HTML, when they use the editor they see the HTML change
  • it's always possible to go to any HTML spec we want in the future, if we have the pure data format, but in the other direction we are trapped in the specific serialization we've chosen

@mfenner
Copy link

mfenner commented May 20, 2015

My two cents:

  • if we go with XML as the source format, I would consider JATS. Not perfect for our scenario, but good enough, and with a large number of tools available.
  • for the HTML representation I worry that we move too much into custom HTML that might be difficult to maintain in the long run. Do we want to write a spec and validators for this HTML? To be more mainstream, I would for example consider using DublinCore meta tags for things like author or title.

@michael
Copy link
Author

michael commented May 21, 2015

I also thought of JATS and being more strict than the spec allows. But I see disadvantages in the hierarchical nature, so we can not use it as a random-access like model (see doc.set api shown above), which would be really handy and expressive.

I could see JATS as an option, though we would need query selectors and some extra markup for data manipulation in many cases. Much better than the HTML option, but we could get more.

I think we should not be afraid of defining a minimal data format tailored for Tahi purposes, so it's most easy to work with from all Tahi subsystems. Then we write generators for JATS, HTML, ePub etc. which we communicate to the public. It's REALLY trivial to implement those, it's just a projection of data to markup.

I just want to bring as an example that Pandoc uses exactly that strategy to deal with all those different input and output formats. There's an internal data format, the Pandoc AST, that is optimized to be most useful for pandoc readers and writers implementations. It's not communicated as an exchange format. We could do the same with a Tahi Source XML (see very first draft). It will be perfectly tailored to the nature of our data, and thus make writing tahi components much much easier, since we don't have to deal with redundancy and hierarchy. We can then consider Tahi Articles like little databases that we can query and update without thinking about the serialization and dealing with redundancies.

  • Set paragraph content: doc.set("paragraph_4.content", "hello <strong>world</strong>")
  • Set header level: doc.set("heading_10.level", 2)
  • Delete paragraph: doc.delete("paragraph_9")
  • Create a figure: doc.create({id: "fig_10", type: "fig", caption: "my <em>caption</em>", url: "fig_14.png"})

And with each change we make, we immediately can output the corresponding JATS, HTML, etc.

@mfenner
Copy link

mfenner commented May 21, 2015

@michael: this makes sense to me. And I am a big fan of the Pandoc AST :). But it means that there is an internal Tahi format different from the HTML output, which is conceptually different from using a single HTML format all the way through.

@jure
Copy link

jure commented May 26, 2015

Note to self, clicking on the star button in GitHub before submitting your comment will result in total loss of content. So here I go again.

I can see the benefits of two formats, one for source and one for presentation, but if the redundancy in HTML's one format scenarios is only these few situations, i.e. the title, citations, and author, that's likely less redundant than using two formats for what can be achieved in one. And that's ignoring the religiousness of Tahi's HTML promise, which should not be easily ignored.

To me that HTML example that you provided looks OK, and if the issue is that every component will have to address these redundancies, perhaps that could be addressed with a single point of contact with the HTML, an HTML writer module, which deals with redundancies in one place.

Additionally, I went digging for thoughts on how to store JSON in data attributes, and it's very straightforward. The HTML spec allows for single quoting of attributes, which means you can do something like this:

<div id="awesome-json" data-awesome='{"game":"on"}'></div> 

And then access it very elegantly like so:

var gameStatus= jQuery("#awesome-json").data('awesome').game;

In summary, my 2 cents is in favor of a one format scenario, with source data stored in the HTML using JSON in data attributes. I feel like this could be done very elegantly as well and would be less redundant than two formats. In essence, both solutions can be elegant, and we also consider, on top of elegance and correctness, that the Tahi HTML promise is an important one, the scale tips in favor of HTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment