Skip to content

Instantly share code, notes, and snippets.

@michael
Last active January 29, 2017 21:43
Show Gist options
  • Save michael/d417c1d034183e567db6 to your computer and use it in GitHub Desktop.
Save michael/d417c1d034183e567db6 to your computer and use it in GitHub Desktop.

The Tahi Article Format

Note: The following spec is considered a draft and open for discussion.

Tahi Source XML

Content in Tahi is considered data. That data lives in several places: Figures and references are considered external resources that live in separate DB tables. The main content however is expressed in XML using a strict schema. By choosing XML, we can utilize toolchains for automated data processing (e.g XSLT for normalizing content, server side maintenance scripts, etc.).

Goals:

  • Pure data representation
  • An interface and intermediate format for import/export toolchains (a target format for iHat)
  • Optimized for automated processing
  • More compact than HTML (we don't need any presentation specific markup such as wrapping divs, clickable links etc.)
  • Ensure unique id for every content element (mapping between different data formats - makes creating toolchains much easier)
  • Use HTML-compatible markup whenever possible (e.g. for style annotations)
  • Easy to convert from and to HTML (e.g. when pasting from clipboard)
  • Ability to generate NLM/JATS
  • External resources can be included by id: in portable formats they are expanded (e.g. Tahi HTML wrapped in an ePub container)

Example:

<article>
  <meta>
    <title>The Tahi Article Format</title>
    <abstract>Article abstract that can be <strong>annotated</strong></abstract>
    <doi>10.1234/myJournal.00001</doi>
    <status code="in-review">In review</status>
    <creator>userx</creator>
    <created-at>2015-05-12T08:20:19.657Z</created-at>
    <updated-at>2015-05-13T04:18:42.657Z</updated-at>
  </meta>
  <body>
    <h2 id="h1">A heading</h2>
    <p id="p1">
      Some <em>annotated</em> content (<ref type="bib" rid="bib1">Doe, 2010</ref>).
    </p>
    <!-- external content is included by referencing the external id -->
    <include type="fig" rid="fig1"/>
  </body>
  <!-- Resources have external ownership and are extracted from the DB. That means the
  resources element is empty in the source file and gets populated on the fly (e.g. each
  time when a doc is opened in the editor). That way we avoid storing content redundantly -->
  <resources>
    <fig id="fig1">
      <doi>10.1234/myJournal.00001.001</doi>
      <label>Figure 1.</label>
      <title>Figure title</title>
      <caption>
        <em>Annotated</em> Figure caption
      </caption>
      <image src="fig1.png"/>
    </fig>
    <bib id="bib1" bib-type="journal-article">
      <contributors>
        <author id="author1">
          <surname>Doe</surname>
          <given-names>John</given-names>
        </author>
      </contributors>
      <year>2010</year>
      <title>Article <em>X</em></title>
      <source>Journal Y</source>
      <volume>1</volume>
      <fpage>40</fpage>
      <lpage>45</lpage>
      <doi>10.1234/myJournal.00005</doi>
    </bib>
  </resources>
</article>

Tahi HTML

The Tahi HTML format is a complete and human-readable version of the Tahi Article. Meta-information that is not relevant for display is omitted for the most minimal and clean markup. The HTML format is considered a higher level interface for presentation-centric toolchains (e.g. printable output using CSS regions).

Goals:

  • Tahi HTML is a generated view on the content rather than an editable source format
  • It corresponds to a Tahi HTML spec (WIP) so tools can rely on a certain markup
  • A different citation style results in a different HTML presentation
  • Considered a portable and CSS-styleable representation of a Tahi article

Example:

<html>
<head>
  <title>The Tahi Article Format</title>
</head>
<body>
  <div class="section front-matter">
    <h1 class="title">The Tahi Article Format</h1>
    <div class="authors">
      <div class="author">John Doe</div>
    </div>
    <div class="doi"><a href="http://dx.doi.org/10.1234/myJournal.00001">10.1234/myJournal.00001</a></div>
    <div class="abstract">Article abstract that can be <strong>annotated</strong></div>
  </div>
  <div class="section main-text">
    <h2 id="h1">A heading</h2>
    <p id="p1">Some <em>annotated</em> content (<a href="#bib1">Doe, 2010</a>).</p>
    <!-- Expanded figure -->
    <figure id="fig1">
      <img src="fig1.png">  
      <figcaption><span class="label">Figure 1</span><em>Annotated</em> Figure caption</figcaption>
    </figure>
  </div>
  <div class="section references">
    <h2>References</h2>
    <div class="reference">
      <div class="reference" id="bib1">
        <span class="label">1.</span> Doe John (2010) Article X. Journal Y 1: 40-45. Available: <a href="http://dx.doi.org/10.1234/myJournal.00005">10.1234/myJournal.00005</a>.
      </div>
    </div>
  </div>
</body>
</html>
@esetera
Copy link

esetera commented May 13, 2015

Looks clean to me. Im still not entirely clear though, why we need (1) when it can also expressed, without loss, in (2). ie. why use XML when we can represent exactly the same thing in HTML (as demonstrated)...

The only advantage I see in the list of requirements that points to XML is the first item:
Pure data representation

That 'leads' to XML (supposedly). All the other points are what we can do with the article once it is HTML...so, it begs the question to me - what is the advantage of XML here? Currently I don't see it...especially when there are good tools for HTML validation etc and if we write good code for constructing HTML then it is sure to conform to our needs...adding XML actually adds to the complexity by requiring an XML<->HTML conversion which feels unnecessary and an area which can be potentially error prone.

I do think the structure is more readable in (1) however. Which is perhaps ironic as that is the machine readable version ;) How about just using HTML and exploring ways to make the structure cleaner?

@obuchtala
Copy link

IMO opinion there is another aspect which we need to take into consideration: even with a HTML only representation there will be an extra tool necessary due to the distributed nature of our data.

At the moment we have the problem, that HTML stored in a paper's body is in fact a partially redundant representation. Collections are stored in different locations and are processed by other tools/cards (for instance, assets could be relocated). Right now it would be necessary to open the editor manually, to get the body updated.

No matter what representation we choose, I see three aggregation levels:

  • Source/No Redundancy (persistence level): redundant data is never good to have internally. Instead of having expanded nodes such as figures in the body HTML a referential presentation should be used (as Michael proposed with the <include> tag)
  • Source/Self contained (processing level): fully expanded content so that everything necessary for processing is 'in memory'.
  • Presentation (display level): augmented and interpreted version which contains everything necessary for presentation, such as for creating PDF. (see the semantic twist of moving from a meta tag to the positioned front-matter)

tl;dr

I think it is an illusion that there won't be any extra conversion necessary.

@rizwanreza
Copy link

I also don't think we have a lot of advantages by using XML as a source format. I see it as an advantage that HTML can: 1) be opened in the browser, 2) is an underlying ePub format (which iHat works with), and 3) is a format which isn't proprietary. What we need is a subset of HTML that isn't mixed with divs and unnecessary tags in the source, and we can achieve that using validation scripts.

That said, I do think we may require two representations of Tahi HTML: Source and Presentation. The source format is going to point to the collections, but would not include them necessarily, so as to reduce the redundancy. The presentation format can be completely handled in-memory and doesn't need to be persisted. iHat and Tahi talk to each other in the source Tahi HTML format.

@michael
Copy link
Author

michael commented May 14, 2015

why we need (1) when it can also expressed, without loss, in (2). ie. why use XML when we can represent exactly the same thing in HTML (as demonstrated)...

It actually can not be expressed in (2) without loss, also not without redundancy. Here's some examples:

  • a formatted reference (as we want it in the Tahi-HTML) does not carry the structured bibliographic information anymore.
  • some internal meta info such as created-at, updated-at, are crucial in the source format, but should be left out in the Tahi HTML
  • in Tahi-HTML we will have two occurrences of the title: One <title> element in the head and an <h1> that is actually rendered.

I do think the structure is more readable in (1) however. Which is perhaps ironic as that is the machine readable version ;) How about just using HTML and exploring ways to make the structure cleaner?

Hehe, I guess you read like a machine then. ;) Well important is that we have a pure data source without redundancy and excluding externally owned pieces. Imo this raw data does not need to be displayed, as you would not display the contents of a raw database row to the user.

I'm 100% sure that we need to draw a distinction between data and presentation, as Riz and Oliver confirmed. While it's pretty clear to us how the presentation format will look like, we need to the define the source format. For modelling data, XML just came too naturally.

For comparison, I've modelled the source format in HTML:

<html>
  <head>
    <!-- no title annotations possible with meta tags so moved into html body -->
    <!-- no abstract annotations possible with meta tags so moved into html body -->
    <meta name="doi" content="10.1234/myJournal.00001"/>
    <meta name="status" content="in-review"/>
    <meta name="creator" content="userx"/>
    <meta name="created-at" content="2015-05-12T08:20:19.657Z"/>
    <meta name="updated-at" content="2015-05-13T04:18:42.657Z"/>
  </head>
  <body>
    <div data-type="meta">
      <div data-type="title" id="title">The Tahi Article Format</div>  
      <div data-type="abstract" id="abstract">Article abstract that can be <strong>annotated</strong></div>
    </div>
    <div data-type="body">
      <h2 id="h1">A heading</h2>
      <p id="p1">
        Some <em>annotated</em> content (<span data-type="ref" data-ref-type="bib" data-rid="bib1">Doe, 2010</span>).
      </p>
      <!-- external content is included by referencing the external id -->
      <div data-type="include" data-include-type="fig" data-rid="fig1"/>
    </div>
    <!-- Resources have external ownership and are extracted from the DB. That means the
    resources element is empty in the source file and gets populated on the fly (e.g. each
    time when a doc is opened in the editor). That way we avoid storing content redundantly -->
    <div data-type="resources">
      <div data-type="fig" id="fig1">
        <div data-type="doi">10.1234/myJournal.00001.001</div>
        <div data-type="label">Figure 1.</div>
        <div data-type="title">Figure title</div>
        <div data-type="caption">
          <em>Annotated</em> Figure caption
        </div>
        <!-- I did not choose the img tag here because the source format should only has an image
        identifier, not the fully qualified url, so we can change where images are hosted without touching the source files -->
        <div data-type="image" data-image-name="fig1_image_a"/>
      </div>
      <div data-type="bib" id="bib1">
        <div data-type="contributors">
          <div data-type="author" id="author1">
            <div data-type="surname">Doe</div>
            <div data-type="given-names">John</div>
          </div>
        </div>
        <div data-type="year">2010</div>
        <div data-type="title">Article <em>X</em></div>
        <div data-type="source">Journal Y</div>
        <div data-type="volume">1</div>
        <div data-type="fpage">40</div>
        <div data-type="lpage">45</div>
        <div data-type="doi">10.1234/myJournal.00005</div>
      </div>
    </div>
  </body>
</html>
  • I used html meta tags for different meta information. however we can not use annotations there so i had to move the title and abstract into the html body, where we need extra wrappers to designate meta, body, resources areas.
  • Also i want to point out that while this is valid html and would display in the browser, it not gives you much to look at the raw data like that
  • for the source format, html only complicates things imo
  • with XML (source) and HTML (presentation) we would clearly communicate data vs view while there would be some confusion about HTML (source) vs. HTML (presentation).

@esetera
Copy link

esetera commented May 14, 2015

I'm quite happy with that HTML representation as the data format.

@mikem
Copy link

mikem commented May 14, 2015

I think we can lean on HTML5 elements for this representation to make it look more natural. For example:

  • <div data-type="meta"> -> <header>
  • <div data-type="body"> -> <main>
  • <div data-type="fig" id="fig1"> -> <figure id="fig1"> (and include a <figcaption> tag)

Why not include the resources inline, where they appear in the document?

Why not use regular <a> tags for references?

@michael
Copy link
Author

michael commented May 15, 2015

<header> and <main> could be a good idea, however i would not use the <figure> element because our key-value esque markup inside would violate it's validity.

Here's a slighly different variation. I removed the <head> part and united all meta properties inside the <header> element. Still, for me modelling data in HTML just does not feel right.

<html>
  <body>
    <header id="meta">
      <div data-type="title" id="title">The Tahi Article Format</div>  
      <div data-type="abstract" id="abstract">Article abstract that can be <strong>annotated</strong></div>
      <div data-type="doi">10.1234/myJournal.00001</div>
      <div data-type="status">in-review</div>
      <div data-type="creator">userx</div>
      <div data-type="created-at">2015-05-12T08:20:19.657Z</div>
      <div data-type="updated-at">2015-05-13T04:18:42.657Z</div>
    </header>
    <main id="content">
      <h2 id="h1">A heading</h2>
      <p id="p1">
        Some <em>annotated</em> content (<span data-type="ref" data-ref-type="bib" data-rid="bib1">Doe, 2010</span>).
      </p>
      <!-- external content is included by referencing the external id -->
      <div data-type="include" data-include-type="fig" data-rid="fig1"/>
    </main>
    <!-- Resources have external ownership and are extracted from the DB. That means the
    resources element is empty in the source file and gets populated on the fly (e.g. each
    time when a doc is opened in the editor). That way we avoid storing content redundantly -->
    <footer id="resources">
      <div data-type="fig" id="fig1">
        <div data-type="doi">10.1234/myJournal.00001.001</div>
        <div data-type="label">Figure 1.</div>
        <div data-type="title">Figure title</div>
        <div data-type="caption">
          <em>Annotated</em> Figure caption
        </div>
        <!-- I did not choose the img tag here because the source format should only has an image
        identifier, not the fully qualified url, so we can change where images are hosted without touching the source files -->
        <div data-type="image" data-image-name="fig1_image_a"/>
      </div>
      <div data-type="bib" id="bib1">
        <div data-type="contributors">
          <div data-type="author" id="author1">
            <div data-type="surname">Doe</div>
            <div data-type="given-names">John</div>
          </div>
        </div>
        <div data-type="year">2010</div>
        <div data-type="title">Article <em>X</em></div>
        <div data-type="source">Journal Y</div>
        <div data-type="volume">1</div>
        <div data-type="fpage">40</div>
        <div data-type="lpage">45</div>
        <div data-type="doi">10.1234/myJournal.00005</div>
      </div>
    </footer>
  </body>
</html>

That way all properties can manipulated in the same way. E.g. when a program wants to update the updated-at property using the DOM API it would work like this:

var updatedAtEl = document.getElementById("meta div[data-type=updated-at]");
updatedAtEl.textContent = "2015-05-15T16:34:27.373Z";

Saying that i just realized some more problems with server-side processing of HTML:

  • The code above only works well in browser environments
  • On the server you need to rely on implementations such as JSDOM to use the browser's DOM API

Given that the Tahi server is built in Ruby it wouldn't it be easier to manipulate XML using Nokogiri or XSLT workflows?

Only focussing on the source format (data), can we re-verify our argumentations about choosing HTML?

HTML

Definition: HyperText Markup Language, commonly referred to as HTML, is the standard markup language used to create web pages.

  • can be viewed in the browser (see screenshot below)
  • ... what more?

XML:

Definition: Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format which is both human-readable and machine-readable

  • less markup
  • easy validation using dtd, XML-Schema
  • much better tool support for automated processing (outside of the browser)
  • designed to model domain-specific data structures

I'm sorry for insisting so long on considering XML.. but the decision on the data format will have a huge impact on the project in the future... just don't want us to make a mistake here.

Here's how the pure-data HTML format renders in the browser:

@mikem
Copy link

mikem commented May 18, 2015

Processing HTML on the server side with Nokogiri shouldn't be any more difficult than XML.

I'm still curious, though: why not include resources inline?

@michael
Copy link
Author

michael commented May 19, 2015

They can not be inline because they are not owned by the source file (live in the db). Instead they are included, so they can be dynamically expanded when we serve the presentation HTML.

Actually the pure source format does not have the resources at all.. I just modelled them how they would look when generated.

@michael
Copy link
Author

michael commented May 20, 2015

As discussed in our IPM this week, here's the Tahi HTML with semantic annotations that combines data and presentation in one file.

<html>
<head>
  <!-- redundancy: same as h1.title, but witout annotations -->
  <title data-type="title">The Tahi Article Format</title>
</head>
<body>
  <div data-type="meta" class="section front-matter">
    <h1 class="title" data-type="title">The <em>Tahi</em> Article Format</h1>
    <div class="authors" data-type="authors">
      <div data-type="author" class="author">John Doe</div>
    </div>
    <!-- redundancy: Plain DOI vs dx.doi.org url prefix -->
    <div data-type="doi" class="doi">
      <a href="http://dx.doi.org/10.1234/myJournal.00001">10.1234/myJournal.00001</a>
    </div>
    <div data-type="abstract" class="abstract">Article abstract that can be <strong>annotated</strong></div>
  </div>
  <div data-type="main-text" class="section main-text">
    <h2 id="h1">A heading</h2>
    <p id="p1">Some <em>annotated</em> content (<a href="#bib1" data-type="ref" data-ref-type="bib-ref" data-ref-target="bib1">Doe, 2010</a>).</p>
    <!-- problem: figures are owned by the db and can not be modfied through the html -->
    <figure data-type="fig" id="fig1">
      <img data-type="url" src="fig1.png"/>
      <figcaption data-type="caption"><span class="label">Figure 1</span><em>Annotated</em> Figure caption</figcaption>
    </figure>
  </div>
  <div data-type="references" class="section references">
    <h2>References</h2>
    <!-- problem: references are owned by the db and can not be modfied through the html -->
    <div class="references">
      <div data-type="bib" class="reference" id="bib1">
        <!-- problem: doi is problematic to update, as the dx.doi.org url must also be adjusted accordingly -->
        <span class="label">1.</span> <span data-type="surename">Doe</span> <span data-type="given-names">John</span> (<span data-type="year">2010</span>) <span class="title">Article <em>X</em></span>. <span data-type="source">Journal Y</span> <span data-type="volume">1</span>: <span data-type="fpage">40</span>-<span data-type="lpage">45</span>. Available: <span data-type="doi" class="doi"><a href="http://dx.doi.org/10.1234/myJournal.00005" >10.1234/myJournal.00005</a></span>.
      </div>
    </div>
  </div>
</body>
</html>

My thoughts:

  • updating this structure is very hard because of redundancy
  • e.g. when we want to update the article title we need to:
    • update h1.title with annotated title
    • derive a plain text version of the title
    • store this plain text version on the title tag with is in the head
  • the article doi is also hard to change because of redundancy:
  • file is hard read: the presentation part is pulluted with semantic tags, while the data part is pulluted with style tags

For illustration:

If we have a pure data format we can manipulate docs in a object key-value fashion like this:

doc.set("meta.title", "some <em>annotated title</em>");
doc.set("bib1.doi", "10.1234/myJournal.00001");

If we use hierarchical HTML (as specified) we always have to query the DOM for the right elements, and deal with redundancy.

var titleEl = document.querySelector("*[data-type=meta] *[data-type=title]");
titleEl.innerHTML = "some <em>annotated title</em>";
// set the title and use the plain text content
document.title = titleEl.textContent;

I would suggest a pragmatic approach:

  • we could go with a pure data format (XML) internally
  • keep this a secret (we don't need to communicate this as an interface)
  • what people will see is only the HTML, when they use the editor they see the HTML change
  • it's always possible to go to any HTML spec we want in the future, if we have the pure data format, but in the other direction we are trapped in the specific serialization we've chosen

@mfenner
Copy link

mfenner commented May 20, 2015

My two cents:

  • if we go with XML as the source format, I would consider JATS. Not perfect for our scenario, but good enough, and with a large number of tools available.
  • for the HTML representation I worry that we move too much into custom HTML that might be difficult to maintain in the long run. Do we want to write a spec and validators for this HTML? To be more mainstream, I would for example consider using DublinCore meta tags for things like author or title.

@michael
Copy link
Author

michael commented May 21, 2015

I also thought of JATS and being more strict than the spec allows. But I see disadvantages in the hierarchical nature, so we can not use it as a random-access like model (see doc.set api shown above), which would be really handy and expressive.

I could see JATS as an option, though we would need query selectors and some extra markup for data manipulation in many cases. Much better than the HTML option, but we could get more.

I think we should not be afraid of defining a minimal data format tailored for Tahi purposes, so it's most easy to work with from all Tahi subsystems. Then we write generators for JATS, HTML, ePub etc. which we communicate to the public. It's REALLY trivial to implement those, it's just a projection of data to markup.

I just want to bring as an example that Pandoc uses exactly that strategy to deal with all those different input and output formats. There's an internal data format, the Pandoc AST, that is optimized to be most useful for pandoc readers and writers implementations. It's not communicated as an exchange format. We could do the same with a Tahi Source XML (see very first draft). It will be perfectly tailored to the nature of our data, and thus make writing tahi components much much easier, since we don't have to deal with redundancy and hierarchy. We can then consider Tahi Articles like little databases that we can query and update without thinking about the serialization and dealing with redundancies.

  • Set paragraph content: doc.set("paragraph_4.content", "hello <strong>world</strong>")
  • Set header level: doc.set("heading_10.level", 2)
  • Delete paragraph: doc.delete("paragraph_9")
  • Create a figure: doc.create({id: "fig_10", type: "fig", caption: "my <em>caption</em>", url: "fig_14.png"})

And with each change we make, we immediately can output the corresponding JATS, HTML, etc.

@mfenner
Copy link

mfenner commented May 21, 2015

@michael: this makes sense to me. And I am a big fan of the Pandoc AST :). But it means that there is an internal Tahi format different from the HTML output, which is conceptually different from using a single HTML format all the way through.

@jure
Copy link

jure commented May 26, 2015

Note to self, clicking on the star button in GitHub before submitting your comment will result in total loss of content. So here I go again.

I can see the benefits of two formats, one for source and one for presentation, but if the redundancy in HTML's one format scenarios is only these few situations, i.e. the title, citations, and author, that's likely less redundant than using two formats for what can be achieved in one. And that's ignoring the religiousness of Tahi's HTML promise, which should not be easily ignored.

To me that HTML example that you provided looks OK, and if the issue is that every component will have to address these redundancies, perhaps that could be addressed with a single point of contact with the HTML, an HTML writer module, which deals with redundancies in one place.

Additionally, I went digging for thoughts on how to store JSON in data attributes, and it's very straightforward. The HTML spec allows for single quoting of attributes, which means you can do something like this:

<div id="awesome-json" data-awesome='{"game":"on"}'></div> 

And then access it very elegantly like so:

var gameStatus= jQuery("#awesome-json").data('awesome').game;

In summary, my 2 cents is in favor of a one format scenario, with source data stored in the HTML using JSON in data attributes. I feel like this could be done very elegantly as well and would be less redundant than two formats. In essence, both solutions can be elegant, and we also consider, on top of elegance and correctness, that the Tahi HTML promise is an important one, the scale tips in favor of HTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment