Ruby in HTML

Author: Martin J. Dürst ([email protected]). Last updated May 14th, 1996

Abstract

Ruby are used frequently in Japanese and other ideographic languages to indicate the pronounciation of a character or character combination. They are represented smaller in size than the main text and placed atop (in horizontal text) or to the right (in vertical text) of the main text. In electronic texts, they can also be used for searching, indexing, and text-to-speach conversion. This document discusses the necessity and the possible solutions for the inclusion of ruby in HTML.

Here you can find a copy of the first version (version 000) of the Internet-Draft on Ruby in HTML. The following corrections were or are being made for the next version (planned for end of September):

\*QChanges to the DTD\*U has to be changed, because otherwise, you can put RUBY even on IMG, SUP, SUB, and so on, where they don't make sense. Please send me comments if you have specific ideas with respect to the elements that ruby should go on or should not go on.
Corrected reference to JIS X 4051.
Slightly adjusted notation definition and examples.
Corrected some spelling misses.

The spelling of the word "Ruby" was "Rubi" in older versions of this document. "Ruby" is the correct English spelling, and should be used in all cases.

Comments are wellcome and should be directed to the author (preferred) or to the HTML Working group mailing list (use with care!). This is the place to send mail to if you want to subscribe to that list.

Ruby Explained
Places for a HTML Ruby Proposal
HTML Syntax for Ruby
A Formal Proposal
Hints for Implementers
History of this Proposal

1. Ruby Explained

1.1 Ruby Basics

Ruby are small characters used for annotations of a text, at the right side for vertical text, and atop for horizontal text, to indicate the reading (pronounciation) of ideographic characters.

They are used in Japan in most kinds of publications, such a books and magazines, but also in China, especially in schoolbooks. With the more and more international use of the WWW, new and very beneficial uses of ruby can also appear.

In texts stored electoronically and enriched with structural markup, ruby can be very convenient for other applications than pure browsing. In particular, they will be of immense value for searching, indexing, and text-to-speach conversion.

The name "ruby" is the name of the 5.5 point type size in British terminology; this was the size most used for ruby. We can therefore consider it a rather neutral term, neutral enough to be suitable for HTML.

1.2 The Appearance of Ruby

Ruby are in most cases set at half the size of the main letters, resulting in a possible two ruby characters per main character, and taking up half of the width of the main characters. However, at least up to five ruby characters per main character are possible (an example is "u-ke-ta-ma-wa-..."), and so various solutions, from leaving white space in the main text to having the ruby overlap the next characters of the main text, are possible (the later is possible in Japanese especially because in many cases, the characters around an ideograph with ruby are syllabic, and therefore the assignement of ruby to main characters poses no problems for the reader).

The exact arangement of ruby in different situations is indeed rather difficult, and typesetting and word processing software is often judged by the quality of its ruby treatment. To allow easily readable high quality rendering, a simple association of a number of base characters with a number of ruby characters on one level is not sufficient.

As an example, take the name Kobayashi, which is written with two ideographs, which we will denote here with KO and HAYASHI. For the ruby, four syllabic letters are necessary, namely ko-ba-ya-shi. As long as both base characters appear together, it is reasonable to distribute the ruby evenly over both, appearing as follows:

        kobayashi
        KOKO HAYA
        KOKO SHI
    KOKO HAYA

However, if the line is split between the base characters KO and HAYASHI, it is rather important that the ruby appear on the same line as the base characters they are associated with, which could appear as:

or similar, but should preferably NOT look as follows:

1.2 Ruby in Japan

Ruby are particularly frequent in Japanese, because of the way CJK ideographs are used in Japanese. Ideographs can have many different readings (pronouciations) because different readings were taken over from different regions of China and at different times when the characters where adopted in Japan. Also, these characters are used to write indigenous Japanese words, and many readings may be possible because the ideograph might cover many different concepts distinguished in the Japanese language.

The use of ruby in Japanese, before world war two mostly in the form of sou-ruby (ruby were provided for all ideographic characters of a text), also led to the many kinds of word-plays that would be impossible without the use of ruby and that can be considered an integral part of Japanese culture from traditional texts to modern advertising. The main use of ruby today is in magazines of all levels, and of course in educational material. Ruby are also used in educational material in China and Taiwan.

In Japan, the term "furigana" is also used instead of "ruby". "Furigana" is composed of the verb "furu" (to attach, sprinkle,...) and "gana" (either hiragana or katakana, one of the two Japanese syllabaries usually used for ruby).

Although the primary field of use is expected to be Japanese and other ideographic writing systems, ruby should not be limited to any subsets of characters.

2. Places for a HTML Ruby Proposal

The frequent use of ruby in Japanese and the importance of ruby in other situations, together with the great potential for educational documents and software on the WWW, are clear indications of the need for ruby in HTML. There are two main places where a ruby proposal/standard could go:

An independent document (internet draft -> RFC)
An integration into the current HTML internationalization (i18n) draft

I very much hope that (2) can be realized. It avoids fragmentation, and it recognizes that ruby are on a similar level such as <SUP>, which is needed for the correct rendering of ordinal numbers and some other text in some languages.

The main reason one could bring forward against proposal (2) is that the i18n draft is already in a rather advanced state, and that anything hampering its advancement to an RFC and further on should be avoided. However, the reasons for the current delay of the i18n draft lie elsewhere, and it should well be possible to integrate ruby without additional delays.

3. HTML Syntax for Ruby

(Are there already other places with SGML syntax for ruby? e.g. TEI?)

3.1 Basic Questions and Requirements

To evaluate the various proposals, it is important to know the various requirements and alternatives. Mainly the following points have to be considered:

What should happen if ruby are not supported by a browser?
There are several possible answers to this:
- Ruby should be shown in some form (e.g. inside parentheses after the characters they belong to), because otherwise important information is lost.
- It is not so important that ruby are shown, because they usually do not affect the meaning of a text.
- If they cannot be displayed properly, it is preferable not to show them. Otherwise, texts with many ruby will be difficult to read.
What is the highest level of rendering quality that should be supported by the syntax?
While it should not be expected or required that the average browser does high-quality ruby rendering, some syntax variants may allow to express the additional information for high-quality rendering, whereas other syntax variants may not allow this.
How important is the shortness of ruby markup? What length can be tolerated?
In some texts, especially for educational purposes, most ideographs will have ruby, and therefore the markup should not be too long.

3.2 Ruby as an Attribute

As a result of discussions up to now, this seems to be the preferred proposal for HTML due to its simplicity and versatility. The following is an example:

<SPAN RUBY="kobayashi">KO HAYASHI</SPAN>

The RUBY attribute should go on all in-line textual elements, so that e.g.

<EM RUBY="kobayashi">KO HAYASHI</EM>

is also possible. A browser that does not know about ruby will ignore the RUBY attribute and only render the base text. Information for higer-quality rendering can be specified as follows if really necessary:

<SPAN RUBY="kobayashi">
    <SPAN RUBY="ko">KO</SPAN>
    <SPAN RUBY="bayashi">HAYASHI</SPAN>
</SPAN>

The various RUBY attributes on different levels of nested elements have to be interpreted as alternatives. Having ruby as attributes does not allow to mark up the ruby text, e.g. with <STRONG>. But this is not necessary; it is not used in current typographic practice and would complicate implementations unnecessarily.

3.3 Ruby as an Element

Ruby can also be represented as an element. There are various variants, such as:

<RUBYBASE> KO HAYASHI </RUBYBASE> <RUBY> kobayashi </RUBY>
<RUBYBASE> KO HAYASHI <RUBY>kobayashi</RUBY> </RUBYBASE>
<SPAN> KO HAYASHI <RUBY>kobayashi</RUBY> <<SPAN>
<RUBYBASE> KO <RUBY>ko</RUBY> HAYASHI <RUBY>bayashi</RUBY> </RUBYBASE>

A browser unaware of the new syntax will render the ruby after the base characters without any distinction; to get acceptable rendering, a convention has to be introduced. The ruby are enclosed in parentheses. The syntax may look like this:

        <SPAN> KO HAYASHI <RUBY>(kobayashi)</RUBY> </SPAN>

This results in an acceptable display on systems that don't know ruby, namely:

        KO HAYASHI (kobayashi)

A browser knowing about ruby would remove "(" and ")" before rendering.

To allow the specification of additional information for high- quality rendering, it can be noted that the variant

<RUBYBASE> KO <RUBY>ko</RUBY> HAYASHI <RUBY>bayashi</RUBY> </RUBYBASE>

allows associations on two levels. To allow associations on more than two levels, an attribute, here called ASSOC, could be introduced as follows:

<SPAN> ABCD <RUBY ASSOC="2-5;1-3,3-6">(ZYXWVUST)</RUBY> </SPAN>

In this syntax, ";" is used to separate hierarchical levels. "," is used to separate several entries of correspondences in a single hierarchical level, and "-" separates a splitting point in the base string from a splitting point in the ruby string.

The numbering of the splitting points starts with 0 before the first character, adding one after each character, for our example:

    0 A 1 B 2 C 3 D 4

and

    0 Z 1 Y 2 X 3 W 4 V 5 U 6 S 7 T 8

The numbering of the splitting points occurs after elimination of parenteses and similar characters as discussed above. The sequence of correspondences at one level has to be formed so that all correspondences from higher levels can be inserted so that in the resulting sequence of correspondences, both the splitting points of the base string and the splitting points of the ruby string are strictly increasing. If this is not possible the whole sequence is illegal. In particular, the top level, where no sequence from a higher level is inserted, is illegal if it is not already formed so that the splitting points of the base string and the splitting points of the ruby string are strictly increasing.

4. A Formal Proposal

The following text should be added to Chapter 4., Additional entities, attributes and elements, of the internet draft on HTML internationalization, in the next version:

Annotations to indicate the pronounciation of ideographic characters
are extremely common in Japanese, and very useful for some other languages.
They are also indispensable for applications such as searching, indexing,
and text-to-speach conversion.

Using the British name of the respecive type size, these annotations
are customarily called "ruby". A new attribute RUBY is introduced on
in-line text elements to contain the respective information.

If RUBY attributes are present on several levels of nested in-line
elements, then these attributes are to be considered as alternatives,
and not in a cumulative way. Such information allows sophisticated
rendering on high-quality browsers.

The length of a group of base characters or the number of ruby
characters per base character are not limited by this specification,
but authors and tools are requested to keep these numbers reasonably low.
Also, there is no restriction of the types of base characters
to which ruby can be attached, or of the types of characters
that can be used as ruby.

Comment: The length of a group of base characters, in the case of Japanese, will have an average of about two, with four or five characters still being common. For the number of ruby per base character, five is a number for which examples are known, but here also the average will be close to two. These numbers will be different if for example Chinese texts are annotated with Pinyin romanization, or other useful combinations of base characters and ruby are used, but then again Latin letters will be about half the width of kana.

5. Hints for Implementers

If you want to implement ruby in your browser, these are some of the possible solutions, listed by increasing typographic quality.

Display ruby in-line, after their base charcaters, in parentheses. In this case, an option to switch off ruby display is almost mandatory, because texts with many ruby will be difficult to read. For other implementations, an option to switch off ruby display may also be a good idea, but it is not as necessary as here.
Place ruby above their base characters, with half the hight of the base characters. Use fixed spacing. In case the ruby are longer than their corresponding base characters, leave some space blank after the base characters. Always keep a group of base characters and their ruby on the same line.
Same as last solution, but expand ruby proportionally in case they are shorter than their associated base characters.
In case the ruby are longer than their associated base characters, test if previous or following characters of the base text have associated ruby. If this is not the case (particularly if these characters are not ideographic), let the ruby overlap the base characters to avoid blank space.
Use nested ruby attributes for highest-quality rendering including line-breaks.

6. History of this Proposal

Ruby were discussed very intensely at the 8th Unicode conference on April 18th/19th 1996 in Hong Kong. The main reason for this was the invited opening talk by Junichiro Kida, who spoke about the importance of ruby in Japanese literature and culture. Discussion continued throughout the conference, at the final question-and- answer section, and after the conference in a more private setting.

The general agreement was that ruby can and should not be encoded as special characters, but that they should be handled on a higher level. One particularly widely used higher level is HTML, and as one of the authors of the HTML I18N draft, I felt the responsibility to carry on this thread further and write this document.

The recent I18N workshop at the WWW conference in Paris has given me more motivation and justifications that ruby should be included in HTML. Comments from François Yergeau (who had drawn up an internal proposal for ruby using attributes) and from others were particularly helpful.

I am grateful to the following persons for their advice and help:

Junichiro Kida, Literary Critic
Yasuo Kida, Apple Japan
Tatsuo L. Kobayashi, Just Systems
François Yergeau, Alis Technology
Gavin Nicol, ETB, Tokyo
Martin Brian
The organizers of the 8th Unicode conference
The participants of the I18N workshop at the WWW conference in Paris

Comments to all aspects of this proposal are wellcome. Please send them to [email protected].

May 14th, 1996, [email protected]

suchi/ruby_in_html.md