Continuing where I left off with the first article about this subject, I’d like to point out some of the implementation details and their ramifications in this article.
Once again, I’d like to clearly state that the PanCJKV IVD Collection is experimental and completely unregistered, meaning that it would be unwise for anyone—besides yours truly, of course—to build and deploy an implementation based on it. The implementation that I built functions correctly because modern environments that support IVSes (Ideographic Variation Sequences) do so in a generic way whereby any valid, though not necessarily registered, sequence is recognized.
One important ramification that implementations need to consider is how unsupported regions are handled. For example, Source Han Sans (and Google’s Noto Sans CJK) currently support four of the eight regions, specifically CN, TW, JP, and KR, in terms of including glyphs that are intended for those regions. In terms of CJK Unified Ideograph coverage, HK is also supported, though it is not yet claimed due to inappropriate glyphs for many of the characters. HK is expected to be a fifth region in the next major version.
So, what happens with IVSes that correspond to an unsupported region? The simple answer is that the VS (Variation Selector) is ignored—but not discarded—and the BC (Base Character), which is a CJK Unified Ideograph, is rendered according to the Format 12 (UTF-32) ‘cmap‘ subtable of the selected font. In the case of the example implementation in the open source PanCJKV IVD Collection project, the default region of the ‘cmap’ table is CN, meaning that IVSes for unsupported regions will display according to CN conventions. However, if the same IVSes are rendered using a different Pan-CJK font that supports one or more of the unsupported (in the example implementation) regions, they will display as intended by that font. The image at the top of this article uses red to indicate IVSes that are unsupported in the example implementation, and are thus rendered according to the default region of the ‘cmap’ table whose glyph corresponds to the first one on each line.
My current plan is to submit a document for UTC #147 (May 9–13, 2016) that proposes that the UTC (Unicode Technical Committee) itself submit the “PanCJKV” IVD collection for registration according to what I have described in these two articles. There are two reasons for this plan:
- Every CJK Unified Ideograph would require eight registered IVSes, meaning that as new CJK Unified Ideographs are added to Unicode, additional IVSes would need to be registered for them. This thus becomes a process that would need to be executed whenever new CJK Unified Ideographs are added in a new version of Unicode, such as what is likely for Unicode Version 10.0 in June of 2017.
- The third field of the IVD_Collections.txt file of the IVD requires a URL of a site describing the collection, and in the case of this IVD collection, Unicode’s Unihan Database Lookup seems to be the most appropriate site for this purpose. The UTC no doubt would want to be involved if an IVD collection were to point to anything on unicode.org
In closing, perhaps the best way to describe the purpose and intent of this IVD collection is as follows: CJK Unified Ideographs are expected to be displayed according to the conventions and limitations of a particular implementation, in terms of which glyphs are supplied for the supported regions and which regions are supported, and that there is no guarantee that characters will display according to the Unicode code charts.
Once again, any and all feedback is welcome and encouraged.
I think you should explain why you have chosen those eight regions but not Singapore (and Malaysia), for which the urgency of standardizing both simplified and traditional characters has been emphasised.
This is excellent feedback, which will be considered. Another region that I have been considering, which is more of a pseudo-region or a style, would be tagged KX, and would correspond to a region-neutral Kangxi style.
With regard to supporting SG and MY via separate registered IVSes, a case would need to be made that demonstrates that the representative glyphs for those regions do not match those of one of the other eight regions. For example, my understanding is that the representative glyphs for CN are suitable for SG use.
If the representative glyphs for those regions do not match those used for one of the existing eight regions for which source references exist, the standardization authorities of those regions need to submit a horizontal extension that provides source references for their region, along with representative glyphs and a request to add new IRG source fields for them, such as kIRG_SGSource for SG, and kIRG_MYSource for MY.
About the possibility of region-agnostic IVSes for the Kangxi pseudo region, that could potentially be handled via an existing region, such as one whose representative glyphs most closely match those used in the Kangxi dictionary. Some people claim that KR is such a region.