Standardized Variants—Part 1

One problem that has been plaguing CJK Compatibility Ideographs is the fact that they are adversely affected by normalization. Regardless of which of the four normalization forms is applied—NFC, NFD, NFKC, or NFKD—they are converted to their canonical equivalents, which are CJK Unified Ideographs. This is a problem, particularly for Japan, because 75 kanji in JIS X 0213:2004 kanji map to CJK Compatibility Ideograph code points. Furthermore, 57 of these 75 kanji correspond to Jinmei-yō Kanji (人名用漢字), meaning that they are used for personal names. The bottom-line problem with CJK Compatibility Ideographs is that any application of normalization, by any process, will permanently remove any distinctions between a CJK Compatibility Ideograph and its canonical equivalent. Not all processes are under one’s direct control, meaning that it is impossible to guarantee that normalization will not be applied. My opinion is that it is prudent to assume that normalization will be applied, and that preemption is the best solution.

The background is that WG2 N4246 proposed 1,002 standardized variants that correspond to the 1,002 CJK Compatibility Ideographs, which are implemented as variation sequences. The first character in the variation sequence is a CJK Unified Ideograph that is also the canonical equivalent of the CJK Compatibility Ideograph, and the second character is one of the 16 Variation Selectors in the BMP (as opposed to the 240 in the Variation Selectors Supplement in Plane 14 that are reserved for Ideographic Variation Sequences).

The benefit of these standardized variants, of course, is that they are immune to the effects of normalization. In all fairness, I should note that an expert from Japan, Mr. Masahiro Sekiguchi, objected to these standardized variants, and the details of his objections, along with alternative solutions, are recorded in WG2 N4247. The reply from the UTC (Unicode Technical Committee) is recorded in WG2 N4309.

The current status of these 1,002 standardized variants is that they passed WG2 (short for ISO/IEC JTC1/SC2/WG2), and are likely to be incorporated in a future version of ISO/IEC 10646 (and thus Unicode). But, before it is a done deal, they must be included in a ballot. The Unicode Pipeline Table, which details proposed characters and their current standardization status, already lists them.

Stay tuned for Parts 2 and 3 of this article. Part 2 will compare these standardized variants with IVSes (Ideographic Variation Sequences), and Part 3 will detail font implementations.

2 Responses to Standardized Variants—Part 1

  1. 欄 廊 朗 虜 類 猪 神 祥 福 諸 都 侮 僧 勉 勤 卑 嘆 器 墨 層 悔 憎 懲 敏 暑 梅 海 渚 漢 煮 琢 碑 社 祉 祈 祐 祖 祝 禍 禎 穀 突 節 練 繁 署 者 臭 著 視 謁 謹 賓 贈 逸 難 響
    vs.
    欄 廊 朗 虜 類 猪 神 祥 福 諸 都 侮 僧 勉 勤 卑 嘆 器 墨 層 悔 憎 懲 敏 暑 梅 海 渚 漢 煮 琢 碑 社 祉 祈 祐 祖 祝 禍 禎 穀 突 節 練 繁 署 者 臭 著 視 謁 謹 賓 贈 逸 難 響

    • Interestingly, the line that was comprised of the 57 CJK Compatibility Ideographs that correspond to Jinmei-yō Kanji looked fine prior to approving your comment, but was normalized upon approving/publishing the comment, thus causing both lines to become the same. If I enter the characters as Numeric Character References, in the form &#xXXXX; (where XXXX are the hexadecimal digits that correspond to the Unicode Scalar Value, such as F91D for U+F91D), they are somehow not normalized. See:

      欄 廊 朗 虜 類 猪 神 祥 福 諸 都 侮 僧 勉 勤 卑 嘆 器 墨 層 悔 憎 懲 敏 暑 梅 海 渚 漢 煮 琢 碑 社 祉 祈 祐 祖 祝 禍 禎 穀 突 節 練 繁 署 者 臭 著 視 謁 謹 賓 贈 逸 難 響