Something fell between the cracks!

A peculiar series of events that took place on April 1st (no joke) and 2nd of this year led to the discovery of what can only be described as somewhat of a revelation: A small number of CJK Compatibility Ideographs are necessary for China. This is important, because I made the following statement on page 168 of CJKV Information Processing, Second Edition:



Event #1: My mobile phone, a Verizon Wireless–branded Samsung Galaxy S6 edge, received notification on the morning of April 1st that an OS update was available. Because this is for Android OS Version 6.0.1 (aka Marshmallow), I immediately installed it.

Event #2: I navigated into the /system/fonts/ directory to see what new fonts were lurking inside of my mobile phone, and to my (pleasant) surprise, I noticed the following four Noto Sans CJK fonts: NotoSansJP-Regular.otf, NotoSansKR-Regular.otf, NotoSansSC-Regular.otf, and NotoSansTC-Regular.otf.

I also noticed a font called SECHans-Regular.otf, and when I inspected its tables, I discovered that it was a derivative of NotoSansSC-Regular.otf (the Adobe-branded equivalent is SourceHanSansCN-Regular.otf). Someone, presumably Samsung based on the “SEC” (Samsung Electronics Co., Ltd.) in its filename, added 18 glyphs and removed 100 others.

🤔

The following is an excerpt from the /system/etc/fallback_fonts.xml file:

<family>
  <fileset>
    <file lang="zh-Hans">SECHans-Regular.otf</file>
  </fileset>
</family>
<family>
  <fileset>
    <file lang="zh-Hant">NotoSansTC-Regular.otf</file>
  </fileset>
</family>
<family>
  <fileset>
    <file lang="ja">NotoSansJP-Regular.otf</file>
  </fileset>
</family>
<!-- Changing the priority of samsungkorean font to be higher than the Google Font, because font style -->
<family>
  <fileset>
    <file lang="ko">SamsungKorean-Regular.ttf</file>
  </fileset>
</family>

Event #3: Being the font person that I am, I first analyzed the 100 glyphs that were removed, and all but 28 of them are for characters that have the UTR #51 property Emoji. Removing those glyphs makes sense for fonts that function in a font fallback environment. My guess is that the presence of these glyphs, or rather the mappings to these glyphs in the 'cmap' table, was interfering with the glyphs in the SamsungColorEmoji.ttf or NotoColorEmoji.ttf fonts that are also present in the /system/fonts/ directory (for reasons that should be obvious, the former font takes priority over the latter one on my device).

Despite 28 of the characters (★☉☏☗☜☞☟♀♁♂♩♪♫♬♭♮♯♲♳♴♵♶♷♸♹♺♼♽) not having the UTR #51 property Emoji, their glyphs in the SamsungColorEmoji.ttf font have been emoji-fied, which explains their removal from SECHans-Regular.otf:

Event #4: I then analyzed the 18 glyphs that were added, and this is where things became very, very interesting, at least for me.

After fully analyzing these 18 glyphs, it turns out that there are nine CJK Compatibility Ideographs, along with the 12 CJK Unified Ideographs that are in the CJK Compatibility Ideographs block, that are necessary for GB 18030 support, and hence necessary for China. Glyphs for three of the nine CJK Compatibility Ideographs—U+F995, U+FA0C, and U+FA0D—were already present in NotoSansSC-Regular.otf, which explains the 18 glyphs that were added.

Actually, when I dug further, I determined that these 21 characters pre-date GB 18030, and were included in GBK. What’s interesting is that the Unihan Database lacks kIRG_GSource references for these 21 characters. In other words, these 21 characters fell between the proverbial cracks, which, in my experience, is particularly easy to happen for the twelve CJK Unified Ideographs in the CJK Compatibility Ideographs block.

The image below shows the 9 CJK Compatibility Ideographs (郎凉秊裏隣兀嗀礼蘒) on the first line, and the 12 CJK Unified Ideographs in the CJK Compatibility Ideographs block (﨎﨏﨑﨓﨔﨟﨡﨣﨤﨧﨨﨩) on the second line, and with appropriate glyphs:

This points to the following actions that need to be executed:

  • China needs to submit a horizontal extension to add the 21 kIRG_GSource source references to the Unihan Database, and supply representative glyphs for the code charts. Although these characters pre-date GB 18030, they should use the “G9” source reference prefix that also corresponds to GBK, and being the helpful person that I am, I already prepared the new records.
  • China is apparently in the process of revising GB 18030, and it would be a Very Good Idea™ to reflect the Standardized Variants that correspond to the nine CJK Compatibility Ideographs.
  • I need to prepare an Adobe-GB1-x UVS (Unicode Variation Sequence) definition file for the Standardized Variants that correspond to the nine CJK Compatibility Ideographs. Actually, I already prepared the Adobe-GB1_sequences.txt file, which will be added to AFDKO and its sources.
  • In terms of the Adobe-branded Source Han Sans and Google-branded Noto Sans CJK projects, most of these 21 characters will require a CN glyph, because the existing glyph is inappropriate for CN use. In addition, the region-specific subset definition for China needs to include CN glyphs for these 21 characters, and a UVS definition file for CN use also needs to be prepared, which will be derived from Adobe-GB1_sequences.txt. These actions are now on the radar for the Version 2.000 update.

It is important to point out that these 21 kIRG_GSource source references have been missing from Unicode for over 20 years, at least when one considers that they originated from a standard that was published in 1995.

Better late than never, right?

🐡

Comments are closed.