The “PanCJKV” IVD Collection—Unregistered

Much of the thinking that I did with regard to this unregistered—but hopefully soon-to-be-registered—IVD (Ideographic Variation Database) collection was done while visiting my parents in South Dakota, with one of the highlights of that trip being a scenic drive through Badlands National Park.

First and foremost, please forget, or at least ignore, most everything that was written in the 2016-02-13 and 2016-02-20 articles (which makes one wonder why I am linking to them, but I digress). Far too many things have changed, and what I present in this article represents the IVD collection that I hope will be registered later this year.

In terms of changes, two more regions—SG (Republic of Singapore) and MY (Malaysia), along with the pseudo-region XK (Kāngxī; note that the order of the characters of this region code is intention, in order to be a valid user-defined region code)—were added, increasing the total number of IVSes (Ideographic Variation Sequences) per character to eleven, and assignment of VSes (Variation Selectors) to regions was reversed. For Unicode Version 8.0, this means a total of 884,268 IVSes. The latest PanCJKV IVD Collection project on GitHub reflects these changes, and is also expected to serve as the URL of the site that describes the collection as indicated in the IVD_Collections.txt file.

One of the more interesting changes, in my opinion, is the addition of the pseudo-region XK (Kāngxī), which corresponds to forms that follow the style of the legendary Kāngxī dictionary (康熙字典 kāngxī zìdiǎn). I feel that this is important, because there is no region that completely adheres to this style, meaning that there is also no way to specify this style via language tagging. As the architect and developer of the Adobe-branded Source Han Sans and Google-branded Noto Sans CJK typeface families, I have received a non-zero number of requests to follow the Kāngxī dictionary style. While we have no such plans for these two joined-at-the-hip typeface families, there is clear value in supporting such fonts via this IVD collection.

The table below lists the eleven VSes that would be completely consumed by this IVD collection, their code points, and to which region (or pseudo-region) they correspond (using a two-letter region code that is reflected in the sequence identifier):

Variation Selector Code Point Region Code
VS246 U+E01E5 VN (Việt Nam)
VS247 U+E01E6 KP (DPRK)
VS248 U+E01E7 KR (ROK)
VS249 U+E01E8 JP (Japan)
VS250 U+E01E9 MY (Malaysia)
VS251 U+E01EA MO (Macao SAR)
VS252 U+E01EB HK (Hong Kong SAR)
VS253 U+E01EC TW (ROC)
VS254 U+E01ED SG (Republic of Singapore)
VS255 U+E01EE CN (PRC)
VS256 U+E01EF XK (Kāngxī)—pseudo-region

Sequence identifiers use strings of the form uniHHHH (BMP) or uHHHHH (non-BMP) that correspond to the CJK Unified Ideograph (aka Base Character), followed by an underscore (U+005F), and finally a two-letter region code as shown in the table above. For example, the sequence identifier for the form of (U+66DC) that is used in Japan is uni66DC_JP.

The image below displays all eleven IVSes that correspond to the four CJK Unified Ideographs (U+4E00), (U+5B57), (U+9AA8), and (U+66DC) rendered using the four-region example font implementation, SourceHanSansR04-Regular.otf, with the default region (CN) in blue, supported regions (KR, JP, and TW) in black, and unsupported regions in red (whose glyphs necessarily default to the default region):

This IVD collection will be unique from other registered IVD collections in that representative glyphs are not supplied, because they are not necessary. Instead, each IVS corresponds to what is best described as the intended or appropriate representation for a given region, and it is up to each implementation to decide which regions are supported, and which glyphs to supply.

The scope and intent of this IVD collection is best stated as follows: The IVSes for each CJK Unified Ideograph are expected to be displayed according to the conventions and limitations of a particular implementation, in terms of which glyphs are supplied (or default) for a given region and which particular regions are supported, and that there is no guarantee that characters will display according to the Unicode code charts nor according to regional conventions.

UPDATE: Based on some of the comments below, I have changed the region code KX to XK so that it is a valid user-defined one, and have added to the open source project an alternate UVS definition file, SourceHanSansR11_CN_sequences.txt, that aliases the seven unsupported regions to the four supported ones, specifically MY/SG to CN, HK/MO/VN to TW, and KP/XK to KR. The overly colorful image below illustrates the effect of this aliasing:

UPDATE #2: To illustrate how an implementation would use the IVD_Sequences.txt file, the image below shows the corresponding lines for the four example characters, along with the result of running the script with appropriate command-line options to create the SourceHanSansR11_CN_sequences.txt UVS definition file (red indicates the CID for the JP glyph, green indicates the CID for the TW glyph, black indicates the CID for the CN glyph, and blue indicates the CID for the KR glyph, though it is actually a variant form of the JP glyph):

UPDATE #3: Both example fonts are now in the repository, and instead of using PanCJKV as the identifier, R04 (four regions) and R11 (eleven regions) are now used, and all of the source files have been updated accordingly.

UPDATE #4: I finished writing a document to be discussed during UTC #147 in May, and it is now posted on the UTC’s document register as L2/16-063.


16 Responses to The “PanCJKV” IVD Collection—Unregistered

  1. Charlie says:

    You wrote of “unsupported regions … whose glyphs necessarily default to the default region.”

    Is this a technical limitation? Or more specifically: Why don’t the HK and MO glyphs default to TW, and the KP glyphs to KR?

    • That is purely an implementation decision. For the example font, I used the following command line to build the UVS definition file:

      % perl Scripts/ -cn Resources/ -tw Resources/ -jp Resources/ -kr Resources/ < IVD_Sequences.txt > Sources/SourceHanSans_CN_sequences.txt

      If I had wanted to alias HK, MO, and KP as you suggested, I would use the following command line instead:

      % perl Scripts/ -cn Resources/ -tw Resources/ -hk Resources/ -mo Resources/ -jp Resources/ -kr Resources/ -kp Resources/ < IVD_Sequences.txt > Sources/SourceHanSans_CN_sequences.txt

    • I am thinking to add a second example font implementation that aliases the (seven) unsupported regions to the (four) supported ones. What do you think of the following aliases?

      MY, SG → CN
      HK, MO, VN → TW
      KP, KX → KR

      • Charlie says:

        I think that’s a good idea, though I’m not sure about MY and VN for which I’ve seen fonts that were almost indistinguishable from CN/SG fonts. Just fonts, not standards.

  2. Charlie says:

    This a just an idea and maybe not a good one: All your “regions” are named after countries. If you wanted to do the same with the Kāngxī “pseudo-region” you could call it Qīng Empire.

    • Unless I am overwhelmed with suggestions from others to change “KX” to something else, I prefer to keep it as-is. One of its benefits is that it does not conflict with an existing two-letter country code.

  3. Charlie says:

    To my mind your illustrative graphic has become confusingly colourful. Isn’t it enough to indicate default regions by underlining them in the top row (the one with the country codes) and display all regions that use the same glyphs with exactly the same colour?

    So far I’ve been able to reproduce all of your examples on my computer, but now I need an updated font.

    P.S. (off topic): When will we have a Hong Kong version of Source Han Sans?

    • I was trying to imitate the pride flag. The intent was to use a darker version of the color for the supported regions, and lighter versions for the ones that are aliased to them.

      The HK version of Source Han Sans is dependent on the HKSCS revision. My hope is that this will be done sometime this year.

  4. Charlie says:

    How does your IVD collection relate to Unicode tag characters? Will you comment on this in your proposal?

    • No relationship whatsoever. And, given the current status of the tag characters, mentioning them in the context of this IVD collection makes little sense. A variation sequence is a completely different beast.