To UVS, Or Not To UVS

Several months ago I updated the Adobe-Japan1-UCS2 “ToUnicode” mapping file in the open source Mapping Resources for PDF project specifically to accommodate the two Adobe-Japan1-7 CIDs, CIDs 23058 and 23059.

However, that ToUnicode mapping file is long overdue for a rather extensive update for other reasons, and part of the delay was intentional on my part. The purpose of this article is to outline the reason for the delay, along with providing more concrete update plans.

The ToUnicode Mapping File

The purpose of the ToUnicode mapping file is to “derive content” from PDFs whose embedded fonts include glyphs that are referenced only by CIDs (if CID-keyed) or GIDs, and do not already include an embedded ToUnicode mapping table. The premise is that a CID—or GID—is meaningless without knowing from which code point it was originally mapped, or could have been mapped for the small number of ambiguous cases. Deriving content from PDFs allows text to be repurposed via Copy&Paste, so this is important.

A ToUnicode mapping file does exactly what its name suggests: it maps CIDs to Unicode code points, or to code point sequences. Unlike CMap resources that map Unicode code points to CIDs, or 'cmap' tables that map code points to GIDs that may also be CIDs, a ToUnicode mapping file specifies the inverse mapping. Some omissions and ambiguities can arise, either because a glyph is represented as a sequence, or it is mapped from multiple code points. An excellent example of the former is Adobe-Japan1-7 CID+16246 (ㇷ゚, which should not be confused with U+30D7 プ that corresponds to CID+979), which is not mapped from the Adobe-Japan1-7 Unicode CMap resources, because it is represented as the sequence <U+31F7, U+309A> (CIDs 16243 and 16327), and supported via the 'ccmp' (Glyph Composition/Decomposition) GSUB feature as the same sequence:

substitute \16243 \16327 by \16246;

The Adobe-Japan1-UCS2 ToUnicode mapping file maps this CID to the following sequence (3f76 is the hexadecimal form of decimal 16246):

<3f76> <31f7309a>

An excellent example of the latter is Adobe-Japan1-7 CID+1200, which is mapped from U+2F00 ⼀ KANGXI RADICAL ONE and U+4E00 一 (a CJK Unified Ideograph). If CID+1200 is included in a PDF, one would naturally expect U+4E00 一 to be copied, not U+2F00 ⼀ as its use is more obscure. The Adobe-Japan1-UCS2 ToUnicode mapping file makes this mapping preference explicit (04b0 is the zero-padded hexadecimal form of decimal 1200):

<04b0> <4e00>

The primary client of ToUnicode mapping files is, of course, Adobe Acrobat. Other PDF-consuming apps can also make use of these mapping files.

As the contents of the pdf2unicode directory suggest, there are ToUnicode mapping files for our public ROSes (an abbreviation for Registry, Ordering, and Supplement, which is a fancy way of referring to our public CJK glyph sets), meaning Adobe-CNS1-7, Adobe-GB1-5, Adobe-Japan1-7, Adobe-Korea1-2 (though deprecated), and Adobe-KR-9.

Some apps, such as Adobe InDesign, embed a ToUnicode mapping table when exporting PDFs, but other apps, such as Adobe Illustrator, do not. The ToUnicode mapping files become critical when opening PDFs that are exported from the latter app and others like it.

The recent Adobe-Japan1-UCS2 ToUnicode mapping file update involved only CIDs 23058 and 23059, both of which map to U+32FF ㋿ SQUARE ERA NAME REIWA. This was done due to their expected high-profile nature. However, I have intentionally held off on updating this ToUnicode mapping file to accommodate other changes, along with corrections, because I was waiting for UVS (Unicode Variation Sequence) support to become more widespread, in both fonts that specify them in a Format 14 (Unicode Variation Sequences) 'cmap' subtable, and in OSes and apps.

Mapping to UVSes

The first ToUnicode mapping file to map CIDs to UVSes (Unicode Variation Sequences)—an umbrella term for Standardized Variation Sequences (SVSes), Ideographic Variation Sequences (IVSes), and Emoji Variation Sequences (EVSes)—is the one for Adobe-KR-9, Adobe-KR-UCS2, and does so for CIDs 22462 through 22479, which represent the glyphs that are associated with 18 KRName IVSes that are treated as non-default UVSes:

<57be> <537fdb40dd09> <57bf> <5795db40dd02> <57c0> <57cedb40dd05> <57c1> <5abadb40dd04> <57c2> <6210db40dd06> <57c3> <665fdb40dd06> <57c4> <6674db40dd05> <57c5> <695edb40dd04> <57c6> <6d77db40dd05> <57c7> <76dbdb40dd05> <57c8> <8056db40dd06> <57c9> <83bddb40dd08> <57ca> <865cdb40dd05> <57cb> <8941db40dd04> <57cc> <8aa0db40dd05> <57cd> <8acbdb40dd05> <57ce> <927cdb40dd04> <57cf> <9f9cdb40dd07>

So, how does that relate to updating the Adobe-Japan1-UCS2 ToUnicode mapping file? The following methodology that I plan to employ, at least for updating the mappings for the nearly 15K glyphs for kanji (aka ideographs), should explain:

CIDs that unambiguously map from a single CJK Unified Ideograph code points shall map to that code point. The vast majority of CIDs that correspond to kanji are covered by this.
CIDs that map from multiple CJK Unified Ideograph code points shall map to a preferred one.
CIDs that map from CJK Compatibility Ideograph code points shall map to the corresponding SVS.
All remaining CIDs should correspond to Adobe-Japan1 IVSes, and shall map to them.

A small number of CIDs for non-kanji will also map to SVSes, such as the proportional, italic, and full-width forms of the slashed zero, along with the pre-rotated forms of the former two. The CIDs for the former two, along with their pre-rotated forms, will map to the sequence <U+0030,U+FE00>, and the CID for the latter will map to <U+FF10,U+FE00>.

For those who are concerned with mapping CIDs to UVSes, don’t be. The VS (Variation Selector) is default ignorable, meaning that if the consuming app does not support the variation sequence, the BC (Base Character) is displayed as-is, which is considered the ideal fallback for variation sequences. The VS will still be present, in case the UVS is repurposed in an environment that does support it.

My new best friend, GitHub user t-tk, has been kindly suggesting, via Issue #6, some changes for the Adobe-Japan1-UCS ToUnicode mapping file, though some of them would result from performing the steps in the above list, or may be overridden by them. In any case, I genuinely value the feedback that he has been providing.

In closing, I hope to complete this project within the next month or so, but possibly sooner.

🐡

2 Responses to To UVS, Or Not To UVS

Mike Bremford says:

August 15, 2019 at 12:09 PM

Hi Ken – we are a looking forward to this change too.

Is there any thought to updating the UniJIS-UTF16-H CMap as well? At present if you want to display a glyph in PDF that would result from a variation selector (say CID 13697, from U+3402 U+E0101), the only way to do this is to use an Identity map in the PDF: there is no mapping to that CID in the cid2code.txt.

I realise the “shortest code first” approach makes integrating variation selectors into the UTF16-H/V CMap difficult. Is this under consideration, or should those of us creating PDFs just use the Identity encoding for Japanese text?
- Dr. Ken Lunde says:
  
  August 16, 2019 at 8:33 AM
  
  Thank you for your comment and suggestion.
  
  Keep in mind that CMap resources map code points, not sequences, to CIDs. In other words, mapping an IVS, such as <U+3402 U+E0101>, to a CID, which would be CID+13697, is a complete non-starter in the context of Unicode CMap resources. It is simply not possible. IVSes, along with SVSes, are handled via the UVS definition file, which is used to create the Format 14 'cmap' subtable.
  
  The main purpose of the “ToUnicode” mapping resources, such as Adobe-Japan1-UCS2, is to derive content from embedded fonts whose glyphs are identified only by ROS and CIDs. In such resources, it is possible to map a CID to a sequence of code points.

CJK Type Blog

CJK Fonts, Character Sets & Encodings.