IDS + OpenType: Pseudo-encoding Unencoded Glyphs


For those who are not aware, there are twelve IDCs (Ideographic Description Characters) in Unicode, from U+2FF0 through U+2FFB, that are used in IDSes (Ideographic Description Sequences) which are intended to visually describe the structure of ideographs by enumerating their components and arrangement in a hierarchical fashion. Any Unicode character can serve as a IDS component, and the IDCs describe their arrangement. The IRG uses IDSes as a way to detect potentially duplicate characters in new submissions. All existing CJK Unified Ideographs have an IDS, and new submissions require an IDS.

This article describes a technique that uses IDSes combined with OpenType functionality to pseudo-encode glyphs that are unencoded or not yet encoded. If memory serves, it was Taichi KAWABATA (川幡太一) who originally suggested this technique.

For people and organizations who have a need to use glyphs that are not encoded—or glyphs that are not yet encoded because they are currently in the pipeline to be encoded—IDSes can serve as an extraordinarily convenient stopgap measure for pseudo-encoding such glyphs in order to make them immediately usable. Of course, an IDS in and of itself does not have enough information to dynamically compose the glyph to which it corresponds, but a pre-composed glyph can be used as the substitute for an IDS. Furthermore, because IDSes are plain text, when the glyphs that they represent become encoded in a future version of Unicode, their plain-text sequences can be easily converted into the genuine code points for the corresponding characters.

The technique is to use an OpenType feature that converts a sequence of glyphs into a single glyph. The sequence, of course, is an IDS, and the single glyph is a pre-composed form of the intended glyph that corresponds to the IDS.

There are two caveats to the approach of using IDSes to pseudo-encode glyphs:

  • Glyphs for all of the necessary IDCs and components must be contained in the same font resource as the glyphs that the corresponding IDSes represent.
  • The application must support the ‘ccmp‘ (Glyph Composition/Decomposition) GSUB feature. An alternative is to use the ‘liga‘ (Standard Ligatures) GSUB feature, which is arguably more broadly supported, but can be toggled on and off.

To exemplify this technique, we will pseudo-encoded the Japanese form of the not-yet-encoded biáng character by building a small Adobe-Identity-O ROS CID-keyed OpenType/CFF font. This character’s IDS is ⿺辶⿳穴⿰月⿰⿲⿱幺長⿱言馬⿱幺長刂心, and requires five different IDCs (⿰⿱⿲⿳⿺) and nine different components (刂幺心月穴言辶長馬). The tx-generated glyph synopsis below shows the five IDCs at CIDs 2 through 6, the nine components at CIDs 7 through 15, and the pre-composed glyph at CID+16 (click on the image to view a larger version):

Below are the mappings that are used in the CMap resource that is used:

Unicode CID
U+0020 ( ) 1
U+2FF0 () 2
U+2FF1 () 3
U+2FF2 () 4
U+2FF3 () 5
U+2FFA () 6
U+5202 () 7
U+5E7A () 8
U+5FC3 () 9
U+6708 () 10
U+7A74 () 11
U+8A00 () 12
U+8FB6 () 13
U+9577 () 14
U+99AC () 15

Of course, CID+16 is not shown in the above table because it is unencoded. ☺

Below is the ‘ccmp’ GSUB feature definition that is used in the features file:

feature ccmp {
  substitute \6 \13 \5 \11 \2 \10 \2 \4 \3 \8 \14 \3 \12 \15 \3 \8 \14 \7 \9 by \16;
} ccmp;

If you wish to invoke this glyph via the ‘liga’ GSUB feature, either in lieu of or in addition to the ‘ccmp’ feature, simply clone the above and change the two instances of the feature tag accordingly.

For those who’d like to use this font, you can download and install the Heavy version. For those who would like to look at the source files, they are available for download.

Enjoy!

Comments are closed.