IVD Topic: Duplicate Sequence Identifiers

The IVD (Ideographic Variation Database) is all about ideograph variants. Up until earlier this year, its scope was limited to CJK Unified Ideographs, per UTS #37 (Unicode Ideographic Variation Database). Its scope now includes characters with the Ideographic property that are not canonically nor compatibly decomposable, which still excludes CJK Compatibility Ideographs.

In an ideal world, a particular glyph—whether it’s considered the standard (aka encoded) form or an unencoded variant of the standard form—would be associated with a single registered IVS (Ideographic Variation Sequence) within an IVD collection. However, we do not live in a perfect world, and several real-world conditions can lead to duplicate sequence identifiers within an IVD collection.

Duplicate sequence identifiers are explicitly listed in the IVD_Stats.txt file that is included in each version of the IVD.

What conditions lead to duplicate sequence identifiers?

One obvious condition that leads to duplicate sequence identifiers is when the glyph set on which the IVD collection is based maps more than one code point to the same glyph. The Adobe-Japan1 IVD collection includes two such cases, which are shown below:

Sequence Identifier Registered IVS Base Character
𩿎󠄀 CID+19071 𩿎󠄀 <29FCE E0100> 𩿎 U+29FCE
𩿗󠄀 CID+19071 𩿗󠄀 <29FD7 E0100> 𩿗 U+29FD7
櫸󠄀 CID+20152 櫸󠄀 <6AF8 E0100> 櫸 U+6AF8
𣟱󠄀 CID+20152 𣟱󠄀 <237F1 E0100> 𣟱 U+237F1

Another condition is when a particular glyph is considered—in error or otherwise—a variant of more than one ideograph. The Adobe-Japan1 IVD collection includes one such case, which is shown below:

Sequence Identifier Registered IVS Base Character
逹󠄁 CID+13912 逹󠄁 <9039 E0101> 逹 U+9039
達󠄁 CID+13912 達󠄁 <9054 E0101> 達 U+9054

A third condition that can lead to duplicate sequence identifiers is when a glyph is originally treated as a variant, but a subsequent version of Unicode adds a new ideograph that can map to the previously-unencoded glyph. It’s also possible that an earlier version of Unicode is found to include an ideograph that can map to the previously-unencoded glyph. Of course, there is no steadfast requirement that such cases require that a new IVS be registered, using the new ideograph as the base character. That decision is up to the IVD collection registrant. (We will cover this topic in the next section of this article.)

This particular condition occurred four times thus far for the Adobe-Japan1 IVD collection.

The first time was when a kanji that had been treated as a variant was found to be in CJK Unified Ideographs Extension B (Unicode Version 3.1; March 2001), and its glyph now maps from that code point:

Sequence Identifier As Variant Encoded
杞󠄁 CID+14140 杞 vs 杞󠄁 U+675E vs <675E E0101> 𣏌 U+233CC

The second time was in Unicode Version 5.2 (October 2009) when eight ideographs were appended to the URO (Unified Repertoire & Ordering), two of which now map to Adobe-Japan1-6 CIDs:

Sequence Identifier As Variant Encoded
梁󠄁 CID+14089 梁 vs 梁󠄁 U+6881 vs <6881 E0101> 鿄 U+9FC4
祓󠄁 CID+14168 祓 vs 祓󠄁 U+7953 vs <7953 E0101> 鿆 U+9FC6

The third time was in Unicode Version 6.0 (October 2010) when CJK Unified Ideographs Extension D was added, and eight of its 222 ideographs now map to Adobe-Japan1-6 CIDs:

Sequence Identifier As Variant Encoded
衞󠄁 CID+13651 衞 vs 衞󠄁 U+885E vs <885E E0101> 𫟘 U+2B7D8
𣘺󠄂 CID+13724 𣘺 vs 𣘺󠄂 U+2363A vs <2363A E0102> 𫞎 U+2B78E
今󠄁 CID+13780 今 vs 今󠄁 U+4ECA vs <4ECA E0101> 𫝆 U+2B746
勢󠄁 CID+13866 勢 vs 勢󠄁 U+52E2 vs <52E2 E0101> 𫝑 U+2B751
桺󠄁 CID+14064 桺 vs 桺󠄁 U+687A vs <687A E0101> 𫞉 U+2B789
座󠄁 CID+20114 座 vs 座󠄁 U+5EA7 vs <5EA7 E0101> 𫝶 U+2B776
菟󠄃 CID+20201 菟 vs 菟󠄃 U+83DF vs <83DF E0103> 𫟏 U+2B7CF
鐺󠄁 CID+20240 鐺 vs 鐺󠄁 U+943A vs <943A E0101> 𫟰 U+2B7F0

The fourth time is happening right now, and is being handled via PRI #349, which was initiated yesterday, and which is covered in the last section of this article. Unicode Version 10.0 is expected to be published in June of this year, which will add CJK Unified Ideographs Extension F, and two of its 7,473 ideographs can map to Adobe-Japan1-6 CIDs.

Anyway, speaking of IVD collections that include registered IVSes for all ideographs…

Should IVSes be registered for ideographs that lack variants?

The registered Adobe-Japan1 IVD collection includes registered IVSes for all Adobe-Japan1-6 kanji, regardless of whether a particular kanji is associated with one or more unencoded variant forms.

Looking back at the history of the IVD, specifically about the registration of the Adobe-Japan1 IVD collection, I discovered the following wisdom excerpt in an email from Eric Muller—the original co-author of UTS #37 and the original and previous IVD Registrar—that was dated 2006-03-02:

In my opinion, the mechanism of IVSes will be more useful to us if *every* (ideograph) CID is given a variation sequence, not just those CIDs that represent a character with multiple CIDs. Even if today, in AJ1-6, U+x corresponds to a single CID C+y, it still remains that C+y represents a smaller range of glyphs than U+x, and one may wish to record that smaller range in a document today, so that the document remains as intended tomorrow, when another CID for U+x shows up in AJ1-7. That being said, I’ll leave the final decision to you.

In other words, the purpose of registering IVSes for all ideographs in a glyph set is to preserve—or future-proof—the representation of a character in a document. And, as Eric wrote over ten years ago, the decision is ultimately up to the registrant of each IVD collection.

PRI #349

In terms of PRI #349, Registration of additional sequences in the Adobe-Japan1 collection, which was initiated on 2017-03-02, updated on 2017-04-25, and closes on 2017-06-02, the background is that three Adobe-Japan1-6 kanji, CIDs 13834, 14187, and 14226, were found to be present in CJK Unified Ideographs Extension F at U+2D544, U+2E278, and U+2E6EA, respectively. Two of these mappings were added to the Adobe-Japan1-6 Unicode CMap resources in the CMap Resources project on GitHub on 2016-09-03, and the third one on 2017-04-25, and I waited until it was approximately 90 days before the release of Unicode Version 10.0, which will include Extension F, to submit the proposed sequence identifiers to the IVD Registrar, who happens to be yours truly.

Sequence Identifier As Variant Encoded
小󠄂 CID+13834 小 vs 小󠄂 U+5C0F vs <5C0F E0102> 小󠄂 U+2D544
罐󠄁 CID+14187 罐 vs 罐󠄁 U+7F50 vs <7F50 E0101> 罐󠄁 U+2E278
蹊󠄁 CID+14226 蹊 vs 蹊󠄁 U+8E4A vs <8E4A E0101> 蹊󠄁 U+2E6EA

The image below shows the representative glyphs for the three proposed sequence identifiers, rendered using Kozuka Mincho (小塚明朝) and Kozuka Gothic (小塚ゴシック):


Comments are closed.