The “PanCJKV” IVD Collection

Much of the thinking that I did with regard to this unregistered—but hopefully soon-to-be-registered—IVD (Ideographic Variation Database) collection was done while visiting my parents in South Dakota, with one of the highlights of that trip being a scenic drive through Badlands National Park.

First and foremost, please forget, or at least ignore, most everything that was written in the 2016-02-13 and 2016-02-20 articles (which makes one wonder why I am linking to them, but I digress). Far too many things have changed, and what I present in this article represents the IVD collection that I hope will be registered later this year.

In terms of changes, two more regions—SG (Republic of Singapore) and MY (Malaysia), along with the pseudo-region XK (Kāngxī; note that the order of the characters of this region code is intention, in order to be a valid user-defined region code)—were added, increasing the total number of IVSes (Ideographic Variation Sequences) per character to eleven, and assignment of VSes (Variation Selectors) to regions was reversed. For Unicode Version 8.0, this means a total of 884,268 IVSes. The latest PanCJKV IVD Collection project on GitHub reflects these changes, and is also expected to serve as the URL of the site that describes the collection as indicated in the IVD_Collections.txt file.

One of the more interesting changes, in my opinion, is the addition of the pseudo-region XK (Kāngxī), which corresponds to forms that follow the style of the legendary Kāngxī dictionary (康熙字典 kāngxī zìdiǎn). I feel that this is important, because there is no region that completely adheres to this style, meaning that there is also no way to specify this style via language tagging. As the architect and developer of the Adobe-branded Source Han Sans and Google-branded Noto Sans CJK typeface families, I have received a non-zero number of requests to follow the Kāngxī dictionary style. While we have no such plans for these two joined-at-the-hip typeface families, there is clear value in supporting such fonts via this IVD collection.

The table below lists the eleven VSes that would be completely consumed by this IVD collection, their code points, and to which region (or pseudo-region) they correspond (using a two-letter region code that is reflected in the sequence identifier):

Variation Selector	Code Point	Region Code
VS246	U+E01E5	VN (Việt Nam)
VS247	U+E01E6	KP (DPRK)
VS248	U+E01E7	KR (ROK)
VS249	U+E01E8	JP (Japan)
VS250	U+E01E9	MY (Malaysia)
VS251	U+E01EA	MO (Macao SAR)
VS252	U+E01EB	HK (Hong Kong SAR)
VS253	U+E01EC	TW (ROC)
VS254	U+E01ED	SG (Republic of Singapore)
VS255	U+E01EE	CN (PRC)
VS256	U+E01EF	XK (Kāngxī)—pseudo-region

Sequence identifiers use strings of the form uniHHHH (BMP) or uHHHHH (non-BMP) that correspond to the CJK Unified Ideograph (aka Base Character), followed by an underscore (U+005F), and finally a two-letter region code as shown in the table above. For example, the sequence identifier for the form of 曜 (U+66DC) that is used in Japan is uni66DC_JP.

The image below displays all eleven IVSes that correspond to the four CJK Unified Ideographs 一 (U+4E00), 字 (U+5B57), 骨 (U+9AA8), and 曜 (U+66DC) rendered using the four-region example font implementation, SourceHanSansR04-Regular.otf, with the default region (CN) in blue, supported regions (KR, JP, and TW) in black, and unsupported regions in red (whose glyphs necessarily default to the default region):

This IVD collection will be unique from other registered IVD collections in that representative glyphs are not supplied, because they are not necessary. Instead, each IVS corresponds to what is best described as the intended or appropriate representation for a given region, and it is up to each implementation to decide which regions are supported, and which glyphs to supply.

The scope and intent of this IVD collection is best stated as follows: The IVSes for each CJK Unified Ideograph are expected to be displayed according to the conventions and limitations of a particular implementation, in terms of which glyphs are supplied (or default) for a given region and which particular regions are supported, and that there is no guarantee that characters will display according to the Unicode code charts nor according to regional conventions.

UPDATE: Based on some of the comments below, I have changed the region code KX to XK so that it is a valid user-defined one, and have added to the open source project an alternate UVS definition file, SourceHanSansR11_CN_sequences.txt, that aliases the seven unsupported regions to the four supported ones, specifically MY/SG to CN, HK/MO/VN to TW, and KP/XK to KR. The overly colorful image below illustrates the effect of this aliasing:

UPDATE #2: To illustrate how an implementation would use the IVD_Sequences.txt file, the image below shows the corresponding lines for the four example characters, along with the result of running the mkpancjkvivs.pl script with appropriate command-line options to create the SourceHanSansR11_CN_sequences.txt UVS definition file (red indicates the CID for the JP glyph, green indicates the CID for the TW glyph, black indicates the CID for the CN glyph, and blue indicates the CID for the KR glyph, though it is actually a variant form of the JP glyph):

UPDATE #3: Both example fonts are now in the repository, and instead of using PanCJKV as the identifier, R04 (four regions) and R11 (eleven regions) are now used, and all of the source files have been updated accordingly.

UPDATE #4: I finished writing a document to be discussed during UTC #147 in May, and it is now posted on the UTC’s document register as L2/16-063.

🐡

16 Responses to The “PanCJKV” IVD Collection—Unregistered

Charlie says:

February 26, 2016 at 11:05 AM

You wrote of “unsupported regions … whose glyphs necessarily default to the default region.”

Is this a technical limitation? Or more specifically: Why don’t the HK and MO glyphs default to TW, and the KP glyphs to KR?
- Dr. Ken Lunde says:
  
  February 26, 2016 at 12:15 PM
  
  That is purely an implementation decision. For the example font, I used the following command line to build the UVS definition file:
  
  % perl Scripts/mkpancjkvivs.pl -cn Resources/utf32-cn.map -tw Resources/utf32-tw.map -jp Resources/utf32-jp.map -kr Resources/utf32-kr.map < IVD_Sequences.txt > Sources/SourceHanSans_CN_sequences.txt
  
  If I had wanted to alias HK, MO, and KP as you suggested, I would use the following command line instead:
  
  % perl Scripts/mkpancjkvivs.pl -cn Resources/utf32-cn.map -tw Resources/utf32-tw.map -hk Resources/utf32-tw.map -mo Resources/utf32-tw.map -jp Resources/utf32-jp.map -kr Resources/utf32-kr.map -kp Resources/utf32-kr.map < IVD_Sequences.txt > Sources/SourceHanSans_CN_sequences.txt
- Dr. Ken Lunde says:
  
  February 26, 2016 at 3:04 PM
  
  I am thinking to add a second example font implementation that aliases the (seven) unsupported regions to the (four) supported ones. What do you think of the following aliases?
  
  MY, SG → CN
  HK, MO, VN → TW
  KP, KX → KR
  - Charlie says:
    
    February 26, 2016 at 3:21 PM
    
    I think that’s a good idea, though I’m not sure about MY and VN for which I’ve seen fonts that were almost indistinguishable from CN/SG fonts. Just fonts, not standards.
    - Dr. Ken Lunde says:
      
      February 26, 2016 at 3:42 PM
      
      You’re right. MY should alias to CN, not TW (I edited my comment above). VN follows TW conventions more closely, hence the alias.
Charlie says:

February 26, 2016 at 12:49 PM

This a just an idea and maybe not a good one: All your “regions” are named after countries. If you wanted to do the same with the Kāngxī “pseudo-region” you could call it Qīng Empire.
- Dr. Ken Lunde says:
  
  February 26, 2016 at 2:29 PM
  
  Unless I am overwhelmed with suggestions from others to change “KX” to something else, I prefer to keep it as-is. One of its benefits is that it does not conflict with an existing two-letter country code.
  - Charlie says:
    
    February 26, 2016 at 2:58 PM
    
    I see. Note, however, that KX is not a “user-assigned code element” of ISO 3166-1 alpha-2 (something similar to Unicode’s PUA, consisting of the codes QM through QZ; XA through XZ; and ZZ).
    - Dr. Ken Lunde says:
      
      February 26, 2016 at 3:02 PM
      
      The other ten region codes happen to be valid region codes, and there is no requirement that they be valid according to that ISO standard. All that is necessary is some meaningful identifier. Therefore, in the context of the IVD, KX is a perfectly acceptable identifier for the intended purpose.
      - Charlie says:
        
        February 26, 2016 at 3:10 PM
        
        I hope you are right. According to Wikipedia’s ISO 3166-1 alpha-2 decoding table KX is “free for assignment by the ISO 3166/MA only.”
    - Dr. Ken Lunde says:
      
      February 26, 2016 at 3:48 PM
      
      I could simply transpose the letters to arrive at XK, which is a user-assigned code element.
      - Charlie says:
        
        February 26, 2016 at 4:00 PM
        
        Even though this is not a requirement for IVD registration it would be more in line with the rest of the proposal.
Charlie says:

February 26, 2016 at 6:12 PM

To my mind your illustrative graphic has become confusingly colourful. Isn’t it enough to indicate default regions by underlining them in the top row (the one with the country codes) and display all regions that use the same glyphs with exactly the same colour?

So far I’ve been able to reproduce all of your examples on my computer, but now I need an updated font.

P.S. (off topic): When will we have a Hong Kong version of Source Han Sans?
- Dr. Ken Lunde says:
  
  February 26, 2016 at 7:11 PM
  
  I was trying to imitate the pride flag. The intent was to use a darker version of the color for the supported regions, and lighter versions for the ones that are aliased to them.
  
  The HK version of Source Han Sans is dependent on the HKSCS revision. My hope is that this will be done sometime this year.
Charlie says:

February 28, 2016 at 8:00 AM

How does your IVD collection relate to Unicode tag characters? Will you comment on this in your proposal?
- Dr. Ken Lunde says:
  
  February 28, 2016 at 6:01 PM
  
  No relationship whatsoever. And, given the current status of the tag characters, mentioning them in the context of this IVD collection makes little sense. A variation sequence is a completely different beast.

CJK Type Blog

CJK Fonts, Character Sets & Encodings.

The “PanCJKV” IVD Collection—Unregistered

16 Responses to The “PanCJKV” IVD Collection—Unregistered

By Dr. Ken Lunde

Comments (16)

Created